Mining the Blogosphere: Age, gender and the varieties of self-expression

The growth of the blogosphere offers an unprecedented opportunity to study language and how people use it on a large scale. We present an analysis of over 140 million words of English text drawn from the blogosphere, exploring if and how age and gender affect writing style and topic. Our primary result is that a number of stylistic and contentbased indicators are significantly affected by both age and gender, and that the main difference between older and younger bloggers, and between male and female bloggers, lies in the extent to which their discourse is outer or innerdirected. In fact, the linguistic factors that increase in use with age are just those used more by males of any age, and conversely, those that decrease in use with age are those used more by females of any age.

Contents

Introduction

A great deal of research has been carried out over the last few decades on how different groups of people use language differently (see, e.g., Labov, 1972; Biber and Finegan, 1994; Schneider, 2002). This research has often been constrained, however, by the time and expense needed to collect and annotate data. Studies therefore often have had to make do with comparatively small sample sizes, which makes it tricky to determine how general any results actually are.

The growth of the blogosphere, however, provides an interesting way out of this conundrum. Anyone can write a blog, and blogs are written about anything the blogger wishes and in whatever style they wish, typically with no editorial control. Moreover, blogs are electronically available for downloading, so that data collection is greatly eased. Since there are many millions of such blogs, the blogosphere offers an unprecedented opportunity to study, in a natural context and over a vast scale, how different groups of people write.

We report here our analysis of a large corpus of blog postings to see if and how writing topic and style vary with age and gender of the blogger. There has been much research interest in possible differences between male and female language use (Coates, 1986; Labov, 1990; Holmes, 1997; Bergvall, 1999), some of which has raised great interest in the popular literature (e.g., Tannen, 2001). It has also recently been shown that writing topic and style are useful indicators of agelinked psychological developments in personality, interests, and feelings (Pennebaker, et al., 2003; Pennebaker and Stone, 2003). As we have noted, however, previous studies have generally been limited by the difficulty of data gathering, and so have relied on relatively small amounts of text (cf. Bailey and Dyer, 1992; Biber, 1993; Schneider, 2002), often gathered in artificial laboratory settings.

Our corpus comprises over 140 million words of naturally occurring text from randomly selected blogs by men and women from their teens into their forties. By applying factor analysis and machine learning techniques, we demonstrate here clear and consistent patterns of age and genderlinked variation in writing topic and style. We find that older bloggers tend to write about externallyfocused topics, while younger bloggers tend to write about more personallyfocused topics; changes in writing style with age are closely related. Perhaps surprisingly, similar patterns also characterize genderlinked differences in language style. In fact, the linguistic factors that increase in use with age are just those used more by males of any age, and conversely, those that decrease in use with age are those used more by females of any age. Our results thus confirm and generalize earlier results on agelinked (Pennebaker, et al., 2003; Pennebaker and Stone, 2003; Burger and Henderson, 2006) and genderlinked (Mulac and Lundell, 1994; Biber, 1994; Argamon, et al., 2003; Newman, et al., in press) variation in language use. We suggest that our results are best explained by positing a single factor distinguishing internal from external psychological focus that underlies both age and genderlinked variation in language use. Preliminary results along these lines were previously presented by the authors in (Schler, et al., 2006).

Previous work on gender and age effects on the blogosphere has generally been of comparatively small scale. Herring, et al. (2004) have considered several blog genres, particularly the distinction between personal journal type blogs and filter type blogs (which collect and filter information and links). They have noted that most filter blogs are written by male bloggers and by older bloggers. Similarly, Nowson, et al. (2006) found a strong effect of author sex on blog language, finding that femaleauthored blogs were more contextualized (as measured by Heylighen and Dewaele’s (2002) F measure) than maleauthored blogs. In this vein, Huffaker and Calvert (2005) found that teen bloggers are particularly likely to use blogging as a forum for exploring personal issues such as sexual identity. With a few exceptions (e.g., Herring, et al., 2004; Burger and Henderson, 2006), there has been little work on age in the blogosphere.

Some work on computermediated communication (CMC) other than blogs (i.e., discussion groups and email) has applied discourse and content analysis to relevant issues of genderlinked language. Thus, for example, it has been found that maledominated discussion groups had more statements of fact and fewer selfdisclosures (Savicki, et al., 1996), that women had higher rates of using emoticons in their messages (Witmer, 1996), and that email messages about vacations written by females mentioned more about social aspects and shopping while males focused more impersonally on the location, the journey, and local people (Colley and Todd, 2002).

Our research extends this previous work to the automated analysis of a much larger corpus of texts than those previously analyzed for such sociolinguistic variation (compare, e.g., Labov, 1990; Bailey and Dyer, 1992; Mulac and Lundell, 1994; Herring and Paolillo, 2006). This has the effect of minimizing possible sample biases, which is critical when dealing with over a thousand textual variables, as we do here. Moreover, as far as we are aware, our current study is the first to examine the relationship between how language use varies by age with how it varies by gender.

Corpus design

We gathered a collection of blogs from the Web site blogger.com in August 2004. We collected all blogs on the site which (a) contained at least 500 total words including at least 200 occurrences of common English words, and (b) had authorprovided indication of both gender and age. We then randomly selected 10 percent of the documents as a holdout set (for purposes described below). This left an initial collection of 46,947 blogs, summarized in Table 1 (our unit of analysis throughout this paper is each bloggers collected writing from inception until harvest; we do not distinguish between different posts by a given blogger). Note that over 60 percent of bloggers age 17 and below are females, while over 60 percent of bloggers older than 17 are males.

Table 1: Distribution of blogs in our initial collection by age and gender.

Gender age

Female

Male

Total

1317

6949

4120

11069

1822

7393

7690

15083

2327

4043

6062

10105

2832

1686

3057

4743

3337

860

1827

2687

3842

374

819

1193

4347

263

584

847

48 and older

314

906

1220

Total

21682

25065

46747

For purposes of analysis, formatting and nonEnglish text was automatically removed from each blog. To enable reliable age categorization (since a blog can span several years of writing), all blogs for boundary ages (ages 1822 and 2832) were removed. Each blogger was categorized by age at time of harvest: 10s (ages 1317), 20s (ages 2327) and 30+ (ages 3347), and also by gender: male and female. The number of blogs of each gender within each age category were equalized by randomly deleting surplus blogs from the larger gender category. The final corpus thus contained 19,320 blogs (8,240 in 10s, 8,086 in 20s, and 2,994 in 30+), comprising a total of 681,288 posts and over 140 million words. There were, on average, approximately 35 posts and 7300 words in each blog in the corpus.

Factor analysis

We begin by considering the 1000 most frequent words in the corpus. These comprise 323 different function words and 677 different content words, accounting for 59.4 percent and 21.7 percent, respectively, of all word occurrences. We performed an automated factor analysis on the rate of use of each of the 677 content words, to find groups of related words that tend to occur in similar documents. This process, referred to as a meaning
extraction method (Chung and Pennebaker, 2007), yielded twenty coherent factors that depict clear and distinct themes, mostly topicrelated. Word lists for the twenty factors, along with suggestive headings (for reference), are given in Table 2. In addition, we divided the function words into several categories according to their partsofspeech (pronouns, auxiliary verbs, etc.).

Agelinked variation

Table 3 shows the frequencies of each factors average usage in each age and gender class, as well as the same data for function words according to their parts of speech.

Table 3: Mean frequencies of factor and partofspeech usage by age and gender.

Factor

10s

20s

30s+

Male

Female

Overall

Conversation

1.74

1.55

1.33

1.47

1.72

1.59

AtHome

1.11

.80

.75

.86

.98

.92

Family

.65

.75

.94

.69

.79

.74

Time

.65

.74

.68

.65

.73

.69

PastActions

.74

.62

.63

.62

.73

.68

Work

.61

.75

.70

.67

.69

.68

Games

.67

.66

.66

.76

.57

.67

Internet

.61

.63

.68

.74

.52

.63

Location

.52

.65

.63

.60

.58

.59

Fun

.88

.36

.28

.50

.64

.57

Food/Clothes

.53

.55

.55

.49

.60

.54

Poetic

.52

.53

.52

.48

.57

.53

Books/Movies

.51

.54

.54

.54

.51

.53

Religion

.44

.50

.55

.50

.46

.48

Romance

.54

.44

.38

.39

.55

.47

Swearing

.54

.35

.25

.41

.42

.41

Politics

.27

.41

.56

.47

.28

.37

Music

.36

.29

.26

.34

.29

.32

School

.35

.19

.17

.26

.25

.26

Business

.07

.13

.16

.13

.08

.11

Articles

5.10

6.46

6.97

6.46

5.45

5.96

PersonalPronouns

11.72

10.44

9.88

9.84

11.97

10.96

AuxiliaryVerbs

9.04

8.90

8.83

8.76

9.14

8.95

Conjunctions

2.89

2.59

2.48

2.63

2.76

2.70

Prepositions

11.83

13.04

13.30

12.76

12.36

12.56

First of all, these results indicate clear differences in both preferred topic and preferred style between bloggers of different ages [3]. Usage of words associated with Family, Religion, Politics, Business, and Internet increases with age, while usage of words associated with Conversation, AtHome, Fun, Romance, Music, School, and Swearing decreases significantly with age. (All effects mentioned are statistically significant with p < 0.001.) None of the other factors varies directly with age in a statistically significant fashion. In addition to these topicrelated differences in blogs with blogger age, we also see clear differences in style, as measured by frequencies of grammatical partsofspeech. Usage of PersonalPronouns, Conjunctions, and AuxiliaryVerbs decreases significantly with age, while usage of Articles and Prepositions increases significantly with age.

In fact, such variations in word frequency can be exploited to effectively predict the age of a blogs writer. To show this, we computed, for each blog, a vector containing the frequencies in the blog of the abovementioned 377 function words as well as the 1000 most informative words [4] for age. Two different machinelearning algorithms, Bayesian multinomial logistic regression (BMR: Madigan, et al., 2005) and multiclass balanced realvalued Winnow (WIN: Littlestone, 1988; Dagan, et al., 1997), were applied to these frequency vectors to construct classification models for author age. Tenfold crossvalidation [5] was used to estimate generalization accuracy. The results show automatic classification of an unseen document into the correct age interval (10s, 20s, or 30+) with an accuracy of 77.4 percent (using BMR) and 75.0 percent (using WIN). Examination of the confusion matrix shows that 10s are distinguishable from 30+ with over 96 percent accuracy, whereas distinguishing 20s from either of the other two classes is more difficult. Using only function words gives accuracies of 69.4 percent (BMR) and 67.7 percent (WIN), while using just the high informationgain words gives accuracies of 76.2 percent (BMR) and 75.9 percent (WIN). Thus, as we might have expected, topic preference is most related to blogger age, although there is definitely a marked effect on writing style as well.

Genderlinked variation

Regarding blogger gender, we see (Table 3) that Articles and Prepositions are used significantly more by male bloggers, while PersonalPronouns, Conjunctions, and AuxiliaryVerbs are used significantly more by female bloggers. These are the same features that we previously found to indicate male and female writing styles in published fiction and nonfiction works (Argamon, et al., 2003). In contentbased features, we see the factors Religion, Politics, Business, and Internet used more frequently by male bloggers, while the factors Conversation, AtHome, Fun, Romance, and Swearing are more often used by female bloggers. (All effects mentioned are statistically significant with p < 0.001.) Prediction of author gender (as above) from function words and the 1000 words with highest informationgain for gender gave accuracies of 79.3 percent (BMR) and 80.5 percent (WIN). These results are consistent with classification studies on author gender in other types of texts (Argamon, et al., 2003; de Vel, et al., 2002; Hota, et al., 2006).

It should be noted that style and content effects are highly correlated: use of multiple regressions indicates that controlling for style effects essentially eliminates content effects and vice versa. Thus, it may be that choice of content determines particular style preferences, or both content and style may be influenced by a single underlying variable such as genre preference (Herring, et al., 2004). It is highly probable, though, that a more general sociolinguistic variable underlies this phenomenon, for as we have noted, the results of the current study on genderlinked style are virtually identical to those found in studies of vastly differing genres, including published fiction and nonfiction (Argamon, et al., 2003).

Correlating age and gender

It has not escaped our attention that with few exceptions, the factors and partsofspeech that are used significantly more by younger (older) bloggers are also used significantly more by female (male) bloggers. Thus, Articles, Prepositions, Religion, Politics, Business, and Internet are used more by male bloggers as well as older bloggers, while PersonalPronouns, Conjunctions, AuxiliaryVerbs, Conversation, AtHome, Fun, Romance, and Swearing are used more by female bloggers as well as younger bloggers. There are only three exceptions to this pattern: Family, used more by older bloggers and by females; Music, used more by younger bloggers and by males; and, School, for which there is no significant difference between male and female usage.

The force of this observation is highlighted when examining those individual words that evince both strong agelinked and genderlinked effects. We consider the 316 words that are among both the 1000 words with highest information gain for age and the 1000 words with highest information gain for gender (as computed on the holdout set). The scatterplot in Figure 1 plots log(w(male)/w(female)) against log(w(30+)/w(10s)), where w(A) is the average frequency of word w in documents of class A. Note that every word but one (husband) lies in the first (male and 30+) or third (female and 10s) quadrants. That is, with just the one exception, every word we considered that is used more by females is used more by younger bloggers and vice versa. The Pearson correlation between the male/female and 30+/10s logratios is 0.71.

Figure 1: Scatterplot showing log(w(male)/w(female)) on the xaxis plotted against log(w(10s)/w(30+)) on the yaxis.Points shown represent the words with highest information gain for both age and gender as described in the text.

Conclusions

The significance of these results is twofold. First is the fact that, in contradistinction to many previous similar studies, we have analyzed many millions of words of naturally occurring text. This fact lends credence to the conclusion that significant variation in our data reflects real variation in the world (or at least, the world of those likely to write Englishlanguage blogs), and is not a mere artifact of our experimental procedure.

Perhaps more significantly, however, our findings serve to link together earlier observations regarding agelinked and genderlinked writing variation that have not previously been connected. Previous studies investigating gender and language have shown genderlinked differences along dimensions of involvedness (Biber, 1995; Argamon, et al., 2003) or contextualization (Heylighen and Dewaele, 2002). Other studies have found agelinked differences in the immediacy and informality of writing (Pennebaker, et al., 2003). The current study suggests that these two sets of results are closely related. Indeed, they likely both reflect a single underlying distinction between inner and outeroriented communication that may explain both genderlinked and agelinked variation in language use.

About the authors

Shlomo Argamon is Associate Professor in the Department of Computer Science at the Illinois Institute of Technology in Chicago.
Email: argamon [at] iit [dot] edu

Professor Moshe Koppel can be found in the Department of Computer Science at BarIlan University (Ramat Gan 52900, Israel).
Email: moishk [at] gmail [dot] com

James W. Pennebaker is Professor and Chair of the Department of Psychology at the University of Texas in Austin.
Email: pennebaker [at] mail [dot] utexas [dot] edu

Notes

3. We must, of course, keep in mind that since this study is synchronic, we cannot separate generational effects from age effects. Moreover, since older bloggers are somewhat less common, they might represent an atypical demographic as early adopters of technology.

4. The informativeness of words for a particular text class (age or gender) was measured by the information gain measure (Quinlan, 1986), an informationtheoretic formula estimating how much information about the class of a text is conveyed by knowing the frequency of a particular word in the text.

5. Tenfold crossvalidation is a standard technique for estimating the generalization accuracy of a machine learning method (see Mitchell, 1997). The data is randomly divided into ten equally sized segments, and the system repeatedly trains on nine of them and tests on the remaining one; the average of these accuracies is the reported result. Thus we avoid testing on examples that were used in training.

C.K. Chung and J.W. Pennebaker, 2007, in press. Revealing peoples
thinking in natural language: Using an automated meaning extraction
method in openended selfdescriptions, Journal of Research in Personality.

A. Colley and Z. Todd, 2002. Genderlinked differences in the style and content of emails to friends, Journal of Language and Social Psychology, volume 21, number 4, pp. 380392. http://dx.doi.org/10.1177/026192702237955