The researchers analyzed 700 million words and phrases collected from
the Facebook messages of 75,000 volunteers, and found fascinating differences
between the posts of introverts and extroverts, men and women, etc.

Would love to hear your thoughts about the study! For me, the
findings weren’t always substantively surprising, but
seeing them presented as word clouds gave me a whole new view into
human nature.

.

Personality, Gender, and
Age in the Language of Social Media: The Open-Vocabulary Approach

Introduction

The social sciences have entered
the age of data science, leveraging the unprecedented sources of written
language that social media afford[1]–[3]. Through media such as Facebook
and Twitter, used regularly by more than 1/7thof the world's population[4], variation in mood has been
tracked diurnally and across seasons[5], used to predict the stock
market[6], and leveraged to estimate
happiness across time[7],[8]. Search patterns on Google
detect influenza epidemics weeks before CDC data confirm them[9], and the digitization of books
makes possible the quantitative tracking of cultural trends over decades[10]. To make sense of the massive
data available, multidisciplinary collaborations between fields such as
computational linguistics and the social sciences are needed. Here, we
demonstrate an instrument which uniquely describes similarities and differences
among groups of people in terms of their differential language use.

Our technique leverages what
people say in social media to find distinctivewords,phrases,
andtopicsas functions of known attributes of
people such as gender, age, location, or psychological characteristics. The
standard approach to correlating language use with individual attributes is to
examine usage ofa priorifixed sets of words[11], limiting findings to
preconceived relationships with words or categories. In contrast, we extract a
data-driven collection ofwords,phrases,
andtopics, in which the lexicon is based on the
words of the text being analyzed. This yields a comprehensive description of
the differences between groups of people for any given attribute, and allows
one to find unexpected results. We call approaches like ours, which do not rely
ona prioriword
or category judgments,open-vocabularyanalyses.

We usedifferential
language analysis(DLA), our particular method
of open-vocabulary analysis, to find language features across millions of
Facebook messages that distinguish demographic and psychological attributes.
From a dataset of over 15.4 million Facebook messages collected from 75
thousand volunteers[12], we extract 700 million
instances ofwords,phrases,
and automatically generatedtopicsand correlate them with gender, age,
and personality. We replicate traditional language analyses by applying
Linguistic Inquiry and Word Count (LIWC)[11], a popular tool in psychology,
to our data set. Then, we show thatopen-vocabularyanalyses can
yield additionalinsights(correlations between personality and
behavior as manifest through language) and moreinformation(as measured through predictive
accuracy) than traditionala prioriword-category approaches. We present a
word cloud-based technique to visualize results ofDLA.
Our large set of correlations is made available for others to use (available
at:http:www.wwbp.org/).