How does the front page of the Internet behave? Readability, emoticon use, and links on Reddit

Reddit, known as “the front page of the Internet,” has been one of the most widely visited Web sites since its inception in 2005. As a social networking site it is unique in that the personal relationships between its users are considered secondary to its content, which includes both original, user-generated content and links to outside sources. Although previous research has investigated other social networking platforms in depth, relatively little has been written on Reddit. The present research considers a variety of indicators, including text readability, emoticon usage, and domain linkage. It was found that the most popular communities on Reddit behave very differently from each other, in terms of language sophistication, sentiment, and topicality (as measured by top-level links to outside sources). The results can be used to inform future investigations of online discourse spaces, particularly those in the contemporary social media sphere.

The online community Reddit, known as “the front page of the Internet,” is one of the most popular sites in cyberspace. As of September 2016 it is the ninth most-visited Web site in the United States and the 25th most-visited Web site worldwide (Alexa, 2016). It has been estimated that six percent of Internet-using adults visit the site (Duggan and Smith, 2013). A more recent study claims that “seven percent of U.S. adults report using the site,” with more than three-quarters of these using Reddit as a news source (Barthel, et al., 2016). Reddit, originally founded as a simple link-sharing platform in 2005, now describes itself as “a platform for communities to discuss, connect, and share in an open environment, home to some of the most authentic content anywhere online” (Reddit, 2016). The “communities” referred to are more commonly referred to as subreddits, each of which has its own unique appearance, guidelines, team of moderators, and, of course, contributors (called “Redditors”). In this paper, subreddits are referred to in the following form: r/subredditName. This is a reference to the URL at which a subreddit can be visited: for example, the “news” subreddit can be viewed at http://reddit.com/r/news/. This also follows a convention on Reddit, wherein references to a subreddit within a comment are expressed in the “r/subredditName” format.

Although Redditors are allowed to “subscribe” to subreddits (which means that threads from these communities show up on the user’s front page), in most instances it is not necessary to subscribe to a subreddit in order to contribute to the discourse. There are a variety of means by which Redditors can participate in a community, including posting comments on a thread, voting on the quality of threads and comments, and posting original threads, though in the latter case there are sometimes constraints on what can be posted. For example, r/politics (a community devoted to political matters in the United States) restricts users to posting links to outside articles (this only applies to users seeking to start a thread; user-generated content is encouraged in the comments section for any given thread). However, most subreddits allow for users to post original thoughts, ideas, and questions in top-level posts (many, such as r/explainlikeimfive [1] and r/AskReddit [2], are entirely comprised of original content).

Though it may be tempting to view Reddit as a monolithic enterprise, the nature of its subreddit system casts it instead as a loosely joined collection of vastly disparate communities. Reddit’s structure allows its users to reject permanent virtual identities (Bergstrom, 2011), and Becker (2013) observed that “each subreddit has its own theme and ‘personality,’ which cater to its online community of readers.” Moreover, subreddits allow for users to tailor the Reddit experience to their own interests (Mills, 2015).

Although previous research has analyzed social interaction across a variety of subreddits (e.g., Choi, et al., 2015), and subreddits have been discussed in a networked context with users as links and the various communities as nodes (Olson and Neal, 2015; see also Olson, 2013, for a visualization of community ties), there is a dearth of research on the actual content of comment threads. Accordingly, this study seeks to address this gap by adopting a content analysis approach in order to analyze readability and, to a lesser extent, user interaction sentiment across different subreddits (the subreddits under analysis were curated from a list of the most popular subreddits in terms of subscribers, as cited by http://redditmetrics.com). Specifically, emoticon analysis (as a measure of sentiment) was used in conjunction with a battery of readability tests (Flesch-Kincaid, Gunning fog, and SMOG). These readability tests take into account average sentence length, syllable count, etc. in order to estimate the grade level (i.e., years of formal education) required to understand a text. Finally, links provided by original posters (that is, the users who create the threads under analysis) were analyzed and classified, in order to determine which types of sites are linked to via the most popular threads on Reddit. This is of particular interest when one considers the “Reddit hug of death,” which occurs when a highly visible thread calls attention to a little-known site without a robust server; the resulting traffic often crashes the site, necessitating the posting of mirrors so that Redditors can view the linked content. A somewhat less dramatic instance of the same phenomenon can be found by looking at Wikipedia page views, wherein pages related to a popular Reddit thread often experience temporary increases in viewcounts (Moyer, et al., 2015).

In sum, the research was guided by the following questions:

RQ1: To what degree do different subreddits (and subreddit categories) differ in terms of readability?

RQ2: To what degree do different subreddits (and subreddit categories) use emoticons (both in terms of frequency and in terms of sentiment)?

RQ3: What types of sites do different subreddits (and subreddit categories) link to, taking into consideration only links that are found in original posts (OPs)?

Related work

The primary metric of the popularity of any given thread or comment on Reddit is its score. This may be compared to Facebook’s “like” system or Twitter's retweets, although Reddit allows users to downvote items (expressing dislike). Accordingly, assessing the popularity of a thread or comment is not quite as straightforward on Reddit as it is on other social networking sites, a state of affairs that results in Reddit offering several methods of “sorting” comments within a thread or threads within a subreddit. These different methods take into account variables such as time (“hot”), raw upvote scores (“top”), and upvote/downvote ratios (“best”).

Reddit’s voting system has been criticized, with Gilbert (2013) claiming that it does not consistently identify “potentially popular links” [3], thus defeating the purpose of the site. Another criticism is that it leads to “Karma whoring” [4], wherein “users address the lowest common denominator and usually extend already popular topics” (Richterich, 2014) in the hopes of obtaining upvotes and thus a higher karma score. However, it has also been found that Reddit’s voting system was effective at identifying high-quality “quotable phrases” (Bendersky and Smith, 2012). In addition, Turcotte, et al. (2015) have observed that “social media recommendations improve levels of media trust, and also make people want to follow more news from that particular media outlet in the future,” which indicates that it is likely that Reddit’s content delivery system has an effect on frequent visitors. Although it has been noted that “consumption of news from information/news Web sites is positively associated with higher trust, while access to information available on social media is linked with lower trust” (Ceron, 2015), the fact that Reddit by its very nature actually links to outside information sources suggests that it captures the best of both worlds: the experience of social media coupled with the authority granted by “official” news sites (see also Johnson and Kaye [2014], wherein it was found that “reliance on other online sources is linked to perceptions of high credibility of SNS”).

Readability scores

Three of the most commonly used readability tests are Flesch-Kincaid, Gunning fog, and SMOG. All of these tests return an estimated grade level — that is, the amount of formal education required to comprehend any given text. Whereas the Flesch-Kincaid formula (hereafter referred to simply as “Flesch”) takes into account the total number of syllables in a text (Kincaid, et al., 1975), the Gunning fog index only considers the total number of “complex words” (defined as those with at least three syllables) (Bogert, 1985). Both tests take into account the total number of words and sentences in the text under analysis. The SMOG index, consciously designed as a simpler yet more accurate version of Gunning fog, only considers the total number of sentences and the total number of “complex words” (Fitzsimmons, et al., 2010).

Readability scores have been used across a variety of contexts in order to ascertain the “difficulty” of reading a document, including articles published in medical journals (Weeks and Wallace, 2002), business reports (Clatworthy and Jones, 2001; Jones, 1996; Smith and Taffler, 1992), mission statements (Busch and Folaron, 2005), and college textbooks (McConnell, 1982). Flesch scores have been used to detect deceptive language (Burgoon, et al., 2003) and analyze the degree of community present in a classroom setting (Rovai, 2002). In computer-mediated contexts, Flesch scores have been used to analyze online news articles (Knobloch-Westerwick and Johnson, 2014) and medical Web sites (Whitten, et al., 2008). All three scores were used to determine that a job analysis questionnaire was considered to be at the college-level in terms of readability (Ash and Edgell, 1975).

Perhaps the most frequent usage of readability tests in the academic literature is in regard to health care materials. The Gunning fog metric has been used to establish that health care literature provided to patients was too advanced for the intended audience (Gazmararian, et al., 1999), just as the same index (taken in conjunction with Flesch) indicated that consent forms regarding oncology protocols were also too advanced for lay patients (Grossman, et al., 1994). Both SMOG and Flesch were used by Beaver and Luker (1997) to determine that breast cancer booklets distributed in the U.K. were too advanced for most readers, while SMOG on its own has been used to argue that materials relating to other healh-care topics are too advanced for the general public, including HIV/AIDS (Wells, 1994), strokes (Sullivan and O’Conor, 2001), and dentistry (Jayaratne, et al., 2013). Finally, SMOG scores have been used to argue that many online materials relating to health are written above the average reader’s comprehension level (Aliu and Chung, 2010).

In the field of scholarly communication, Gunning fog was used to discover that the peer-review process improves readability, though the end results are still too advanced for many readers (Roberts, et al., 1994). It may be that editors and reviewers either consciously or unconsciously shy away from simplifying submissions too much, as more “difficult” texts are seen as containing better research (Armstrong, 1980).

Applying these scores to a computer-mediated environment is somewhat more difficult, as noted by Sallis and Kassabova (2000), although with proper adjustments (e.g., filtering out unparseable elements such as URLs) they can be quite powerful. The SMOG index in particular has been found to be an accurate measure of the readability of online documents (Gottron and Martin, 2009). In analyzing online messages, Walther (2007) used the Flesch index to determine that writers used more complex language when they believed they were speaking with a university professor, while using less sophisticated language when they were under the impression that they were speaking with a high schooler, indicating a degree of “language accommodation” in this context. Flesch scores have also been used to analyze e-mail correspondence in newsgroups (Sallis and Kassabova, 2000). Finally, Flesch was used as a means of ensuring that texts were written at a fifth-grade reading level (that is, the texts should be readable by a 10–11 year old), in order to facilitate the demands of a usability study concerning a computer-mediated communication health application (Lin, et al., 2009).

Given that there is something of a pervasive trend of “official” materials being too advanced for the general public, one would expect content generated by lay people, such as that found on Reddit, to be far more comprehensible in terms of language sophistication. The question, then is how much “simpler” the discourse on Reddit is, and how it varies across communities and topic types. A cursory observation of the site’s most popular subreddits (even in pure terms of simple comment length) indicates that comments in “entertainment”-based subreddits seem to be relatively short, and comment chains extend deeper into the tree most often when there is a recurring joke that multiple members are exploiting. Conversely, subreddits about “serious” topics (e.g., those classified as “news,” “politics/history,” and “science”) tend to encourage discussion (as opposed to mere reactions to an outside link or original story), which carries with it a greater attention to detail and information-seeking, which in turn would be expected to lead to a higher level of discourse in the comments section. Hence the following hypothesis:

H1: “Entertainment” subreddits will be rated as more accessible via readability tests when compared to “serious” subreddits.

All of these can be found in the Reddit environment. Although the impacts of emoticons on readers have been claimed to be minimal (Walther and D’Addario, 2001), emoticons do indicate intended sentiment, and indeed, Derks, et al. (2007a) found that emoticons “have an impact on message interpretation” and “are useful in strengthening the intensity of a verbal message.” Emoticons have also been used to aid machine learning efforts (Read, 2005). From another perspective, emoticons serve a useful purpose in professional e-mail messages (Skovholt, et al., 2014). Along the same lines, emoticons allow for non-verbal information to be communicated in a computer-mediated context (Lo, 2008), just as they allow users to clarify the intended meanings of their missives (Thompson and Filik, 2016). There does seem to be something of a generation gap in terms of using emoticons as non-verbal indicators (Krohn, 2004), but this is somewhat less relevant in the Reddit environment, given that Reddit users tend to be younger than Internet users as a whole (Barthel, et al., 2016). Finally, previous research has found that people “used more emoticons in socio-emotional than in task-oriented social contexts” [6], which leads to the second hypothesis:

H2: “Entertainment” focused subreddits will be more likely to use emoticons, while more “serious” subreddits will be less likely to use emoticons.

The results of this research can be used to gain insights into the manner in which Reddit as a whole operates, as well as to inform future research that seeks to analyze emoticon use, topicality, and readability in online communities.

Methods

A total of 204 subreddits were chosen for analysis; these subreddits represent the subreddits with the largest number of subscribers in February 2016 (two separate samples, consisting of the top 200 subreddits, were gathered during this time. As there were slight differences between the two lists, the total number of subreddits adds up to 204). These subreddits were identified via the “reddit metrics” site, which provides a wealth of information about Reddit, including the “fastest growing” subreddits, the “top new reddits,” and — most relevantly for the current research — a ranked list of subreddits (“Top subreddits,” 2016). All popularity metrics take the number of subscribers into account.

From each of the 204 subreddits, the top 75 threads of all time (based off of Reddit’s “top” sorting feature) were chosen, and up to 500 comments were selected from each thread (when a thread contained more than 500 comments, the Reddit sorting algorithm was deferred to — that is, the 500 “best” comments were selected). In addition, the “top” threads in a 24-hour period were sampled from each subreddit (up to a maximum of 25 threads per subreddit), and on two separate occasions the “hot” threads in a 24-hour period were also sampled from each subreddit (again, for each of these samples, up to a maximum of 25 threads per subreddit were harvested) [7. In short, the final sample consisted of the top all-time threads from the most popular subreddits, as well as a series of “snapshots” of these subreddits. Thus, this research is not intended to be a description of Reddit as a whole; rather, it seeks to consider what the most popular threads on the most popular subreddits are discussing, and the level of discourse evidenced in the most visible sections of the community.

Each of the subreddits was manually classified into one of 27 exclusive topical categories. While it is true that the comment trees in any given subreddit often drift away from the topic of the original post, the different subreddit categories effectively establish something of a baseline prompt, which makes the topical groupings of the subreddits meaningful. The classifications provided in the ModeratorDuck subreddit were used as a starting point (“Categorization of all subreddits,” 2014), although many subreddits were not included in this list, and several amendments needed to be made. A list of these categories, along with brief descriptions and member subreddits, can be found in Table A in the Appendix.

The study consists of three distinct sections: readability scores, sentiment (as measured by emoticon usage), and domain analyses. Each of these will be discussed in turn.

Readability

The Text-Statistics PHP library was used to facilitate large-scale computations of readability across the corpus (Childs, 2016). Individual scores were calculated for each comment, which were then averaged together to calculate the mean scores for each readability test for each subreddit. Finally, the mean scores of the three different readability tests (Flesch, Gunning fog, and SMOG), were averaged together in order to determine a mean readability score for each subreddit. These scores are expressed as a number that indicates the estimated grade level of education required to understand a text; thus, a score of 8 would indicate that a text is written at an eighth-grade reading level (that is, a 13-year-old student should be able to read the text without any significant difficulties).

Although the Text-Statistics library provided two further readability tests (Coleman-Liau and Automated Readability Index), these were found to be unsatisfactory for computer-mediated environments (particularly when a subreddit contained a high proportion of posts along the lines of “HAHAHAHAHAHAHAHAHAHAHAHAHA”), and thus these scores were not taken into consideration for this project (though it should be noted that the Automated Readability Index has been used in relation to product reviews, wherein language is somewhat more regulated, e.g., Hu, et al., 2012). In addition, any individual comments that were outside the normal Flesch range (0–100) were excluded from analysis for this particular test. A full list of results can be found in Table B in the Appendix.

Sentiment [Emoticons]

The emoticon list was drawn from the EmoticonLookupTable.txt file included in the SentiStrength download (see Thelwall, et al., 2010; Thelwall, et al., 2012). This file includes a list of emoticons mapped to their perceived sentiment (e.g., a smiley face — :) — is given a score of “1,” whereas a frowny face — :( — is given a score of “-1”). However, this list was slightly adapted in order to facilitate analysis. Specifically, the “:/” emoticon was removed from the list, as it generated a large number of false positives due to the presence of URLs in the comments (http:// being the most common violator). Emoji and other non-textual emoticons were not considered, as the Reddit platform only permits textual characters in the comments section.

Two separate analyses were carried out: the percentage of comments per subreddit that used at least one emoticon, and the average sentiment of all emoticons used across a given subreddit’s sampled comments. The results can be found in Table C in the Appendix.

Domain analysis

Whereas some subreddits prohibit the posting of links in top-level posts (often because the nature of the subreddit, such as r/explainlikeimfive and r/showerthoughts, stipulates that top-level posts should only consist of text, often in the form of a question or statement), others prohibit the posting of anything but links (for example, r/politics, wherein OPs must link to an outside source, and the title of the post must be drawn from said source). Keeping this in mind, it is instructive to consider all OPs links across the entire sample, as these represent the outside domains that were linked to most often in the most popular posts in the most popular subreddits (obviously, subreddits that prohibit outside links in OPs are not represented in this analysis). A list of Web sites that were linked to at least 10 times across the entire sample can be found in Table D in the Appendix. These Web sites (n = 21,797) represent 81.1 percent of the links that could be found in the OPs across the sample.

The various Web sites were classified into one of five categories. A list of these categories, along with example sites and brief descriptions, can be found in Table 1 (the precise categories can be found in Table D in the Appendix).

Table 1: Site classifications.

Classification

Explanation

Examples

GIFs

Sites that host silent videos, animations, clips, etc.

gfycat.com, tumblr.com

Images

Sites that host static images

imgur.com; instagram.com

News

Sites that provide current information

bbc.co.uk; huffingtonpost.com

User-generated content

Sites that rely on original user-generated content

en.wikipedia.org; twitter.com

Videos

Sites that host videos with sound

vimeo.com; youtube.com

Results

For each of the dependent variables, a one-way ANOVA was calculated to predict the dependent variable based on the subreddit category variable. A significant finding indicates that the dependent variable is influenced by the subreddit category. All of these relationships were found to be significant at p < .0001, and all had a moderate effect size (η2 > .25), per Ferguson (2009) (Table 2). Accordingly, we can say that that a subreddit’s category has a predictive effect on all of the variables under analysis: the average readability score of a subreddit is dependent on the subreddit’s category, the category of a subreddit is a reliable predictor of emoticon usage within the subreddit, etc.

Table 2: ANOVA results for dependent variables.

Dependent variable

F-statistics

η2

Emoticon score

F(26, 177)=4.375

0.391

Emoticon percentage

F(26, 177)=4.586

0.403

GIFs

F(26, 177)=3.562

0.344

Images

F(26, 177)=9.042

0.57

News

F(26, 177)=7.836

0.535

Readability scores (mean)

F(26, 177)=9.901

0.593

Videos

F(26, 177)=9.958

0.594

User-generated content

F(26, 177)=3.66

0.35

Readability

The mean readability scores across the 27 subreddit categories ranged from 4.6 (indicating a fourth- to fifth-grade reading level) to 7.8 (indicating a seventh- to eighth-grade reading level). Subreddits that were classified as “porn,” “GIFs,” “videos,” and “images” (which, taken together, might be considered a “multimedia” macro category) were found at the lower end of the spectrum, while subreddits classified as “philosophy/religion,” “business/finance,” and “science” (all of which could be considered more “academic,” or at least more likely to spark intricate discourse) were found at the upper end of the spectrum. The full results can be seen in Figure 1 (the y-axis numbers have been selected to emphasize the distinctions between the subreddit categories).

Figure 1: Mean readability scores by subreddit type.

In terms of the Tukey tests, the subreddits classified as “porn” and “videos” (the latter not containing any pornographic material) were consistently rated as having a less-sophisticated discourse style than other subreddits, particularly compared to those categorized as “philosophy/religion” and, to a lesser degree, “business/finance.” In addition, “sports” subreddits tended to rank on the lower end of the readability spectrum. This suggests that the “philosophy/religion” and “business/finance” subreddits contain in-depth discourse (perhaps with the usage of terms that are sufficiently “sophisticated” to register highly on the various readability tests), along with relatively intricate sentence construction. Conversely, the “porn,” “videos,” and “sports” subreddits tend towards comments that consist of simple language with little technical jargon. Moreover, it appears that more actual “discussion” goes on in subreddits that ranked higher on the readability tests (as the resulting back-and-forth between members engenders increasingly sophisticated discourse), whereas the lower-ranked subreddits consist more of simple opinions or arguments (“You’re wrong,” etc.).

Sentiment [Emoticons]

Subreddits classified as “sports,” “random/assorted,” and “humor” had the lowest mean emoticon score (indicating that they had a tendency to use negative emoticons, a tendency to avoid positive emoticons, or a combination of both), while subreddits classified as “relationships” and “health/food” had the highest mean emoticon scores (Figure 2). In terms of emoticon use frequency, subreddits classified as “politics/history” and “news” were the least likely to contain comments that used emoticons, while subreddits classified as “health/food” were the most likely to contain comments that used emoticons (Figure 3).

Figure 2: Mean emoticon scores by subreddit type.

Figure 3: Mean emoticon percentage by subreddit type.

Domain analysis

The types of sites that were linked to by OPs were heavily dependent on the subreddit classification. “Porn” subreddits consistently linked to “GIF” sites at a much higher rate than other subreddit types. Unsurprisingly, “images” subreddits, along with “GIFs,” “photography,” and “porn” subreddits were most likely to link to “images” sites. Similarly, “news” and “business/finance” subreddits were most likely to link to “news” sites (as were “politics/history” subreddits, albeit to a somewhat lesser degree). Finally, the “meta” and “sports” subreddits were the most likely to link to sites containing “user-generated content.”

Discussion

The various subreddits exhibited a number of differences in posting style and content. It was perhaps the readability analysis that exhibited the starkest differences, as the topical focus of a subreddit was a reliable predictor of the complexity of its discourse. Subreddits that aim at answering questions or encouraging discussion (e.g., r/science, r/philosophy, r/AskHistorians) possessed the most linguistically advanced discourses, whereas subreddits such as r/ass, r/milf, r/gonewild, and r/Amateur (the “porn” subreddits) were consistently ranked at the bottom (the conversations on the latter subreddits were generally conducted at no higher than a fourth grade reading level). The latter is hardly surprising, as the words used most often across subreddits such as r/RealGirls (with stopwords ignored) consisted almost exclusively of expletives, obscenities, and terms such as “love,” “yeah,” “fake,” “face,” and “hot.” Conversely, the subreddits that ranked highest in terms of linguistic sophistication used words such as “moral,” “human,” “access,” “articles,” “research,” “science,” “question,” and “answer,” indicating concepts that lend themselves to a deeper level of discourse (as well as an environment in which a question/answer dynamic is frequent, suggesting that requests for more information may lead to more erudite discussions).

Four subreddits — r/AskHistorians, r/philosophy, r/askscience, and r/changemyview — averaged an eighth grade reading level, which was the highest average level observed across the sample, with the exception of two notable outliers: r/rickandmorty and r/circlejerk. These subreddits scored abnormally highly on at least one readability test, illustrating the imperfections inherent in applying these tools to a computer-mediated context. The latter subreddit is easy to explain, as a single post that scores abnormally highly on one or more of the readability tests (e.g., “hahahahahahahahahaha”) will often be copied by many subsequent users, many of whom may add their own variations, most of which will fall outside of the “normal” realm of discourse expected by the readability tests. This same effect can be seen in r/rickandmorty, wherein the results were heavily skewed by the presence of one thread wherein more than 60 posts simply consisted of a long string of capital Hs. When these results were removed, the subreddit ranked near the bottom in terms of linguistic sophistication. As a final note, it is worth mentioning that r/DepthHub, which by its own description “gathers the best in-depth submissions and discussion on Reddit,” ranked highly in the readability tests, simultaneously validating the success of this subreddit and the applicability of selected readability tests for a computer-mediated environment.

The emoticon analysis was not quite as revealing (nor were the subreddits at either end of the spectrum as easily classified), but there were still some interesting findings. Of the six subreddits with a negative score (indicating a greater proportion of emotions classified as negative by SentiStrength’s dictionary), two are sports-related (r/nba and r/nfl), possibly because sports discussions tend to involve negativity towards players, teams, etc. However, it is important to emphasize that, on the whole, “sports” subreddits still had a positive emoticon score, although this was the lowest score witnessed across the 27 subreddit categories.

In terms of raw emoticon counts (not taking positivity/negativity into account), subreddits classified as “health/food” tended to contain more emoticons, whereas subreddits classified as “politics/history” and “news” tended to contain fewer emoticons. This appears to be due to the fact that many “health/food” subreddits involve dieting, wherein emoticons may be used as encouragement (or may be used as reflections of a poster’s individual experiences). Specifically, the two subreddits with the highest percentage of comments containing emoticons — r/loseit (10.43 percent) and r/MakeupAddition (13.18 percent) — are both lifestyle subreddits wherein motivational statements are highly valued. Subreddits such as r/SkincareAddiction and r/bodyweightfitness are not much further down the list. It is also important to note that these “health/food” subreddits also ranked highest in terms of emoticon sentiment, further lending credence to the idea that contributors to these subreddits use positive emoticons as a means of encouraging others and solidifying the community.

Conversely, subreddits such as r/liberal and r/conservative (two of the three subreddits with the lowest percentage of comments containing emoticons, with 1.11 percent and 1.26 percent, respectively), r/news (1.35 percent), and r/politics (1.61 percent) seem to shy away from emoticon use, possibly because factual discussion is prioritized in these communities over personal opinions (or, alternately, because proffered opinions are expected to be “straight,” without any emoticon embellishments or other niceties). Interestingly, “humor” subreddits rank in the lower third of subreddit categories in terms of emoticon usage, suggesting that it may be considered somewhat gauche to use emoticons in these subreddits. A possible explanation is that, in these subreddits, images, videos, GIFs, and plain text are used for humorous effect, and thus emoticons would be the Internet equivalent of a laugh track — leading, distracting, and frowned upon by many in the community. Yet another possible explanation is that the inclusion of r/4chan in the “humor” category may have contributed greatly to this score, considering that 4chan is associated with a rather venomous type of humor.

Finally, in regards to the domain analysis, the vast majority of sites linked to by the top Reddit threads are either mainstream news organizations (e.g., Guardian, New York Times) or social media (e.g., YouTube, Twitter, Imgur). This simultaneously supports and argues against Reddit’s claim to being “the front page of the Internet” — whereas the most popular threads on the most popular subreddits clearly link to popular sites, it is also true the front page of Reddit is not necessarily the best place to seek out information that is available only in less widely-known venues. The most visible segments of Reddit, then, appear to reflect the most visible segments of the Internet, from Wikipedia to Imgur/YouTube to English-speaking news sites (both in the U.K. and in the U.S., which is hardly surprising for an English-based Web site). Of course, the long tail evidenced in regard to linked domains (accounting for 19.1 percent of domains) indicates that Reddit indeed does manage to highlight lesser-known venues, although these venues are (rather predictably) not as prominent as more mainstream sources.

Conclusion

The topical category of any given subreddit is a reliable predictor of the content within the subreddit. While some consequences are expected (e.g., certain subreddits prohibit links in OPs, whereas others require a link to a site such as Imgur; in these situations, it is not surprising that there are systematic differences in OP link domains), others are more surprising. Different subreddits exhibit vastly varying levels of discourse sentiment and sophistication, with “porn” subreddits using more basic vocabularies (posts often consist simply of crass statements such as “hot girl”) and “philosophy/religion” subreddits using the most sophisticated vocabularies. The “health/food” subreddits simultaneously use emoticons most frequently and use them the most positively, indicating that encouraging and motivating others via emoticons is an integral part of being a member of these communities.

Future research in this area could undertake a more comprehensive sentiment analysis based on actual language patterns, although this would most likely need to be conducted manually. Similarly, a robust topical analysis of OPs and their associated comments would go a long way towards determining what, precisely, people are talking about on Reddit. Finally, this study only considered the most popular comments on the most popular threads in the most popular subreddits. The reasoning behind this was that it was considered desirable to analyze what the average Reddit user sees. However, it would be interesting to see if this study’s findings hold up across the whole of Reddit. What is certain, however, is that differences will continue to be found, as Reddit is not a monolithic enterprise. Rather, it is effectively a collection of very different communities, with different participants, different goals, and different norms, which share a common platform but very little else. This may well be the reason for its continued popularity; given that users can locate, join, and even create communities with ease, and given the wide variation in topicality, language use, and user interaction across the site, a large variety of audiences can find much of worth on “the front page of the Internet.”

About the author

Andrew Tsou is a Ph.D. student in the Department of Information & Library Science at Indiana University Bloomington. His research interests include computer-mediated communication and discourse patterns across social media platforms.E-mail: iatsou [at] umail [dot] iu [dot] edu

Acknowledgements

The author would like to thank Patrick Shih for his assistance in preparing this manuscript.

Notes

1. Often referred to as “ELI5” for short, this subreddit allows users to ask questions about a variety of topics, with the understanding that responses should be written as simply as possible (despite the subreddit’s name, responses are not expected to be comprehensible by actual five-year-olds). Analogies are often used to make complicated points.

2. “AskReddit” is somewhat more informal than “explainlikeimfive,” in that many questions involve asking the Reddit community about their opinions/suggestions on a variety of topics.

Michael Bendersky and David A. Smith, 2012. “A dictionary of wisdom and wit: Learning to extract quotable phrases,” Proceedings of the Workshop on Computational Linguistics for Literature, co-located with the 2012 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 69–77; version at http://bendersky.github.io/pubs/2012-3.pdf, accessed 25 October 2016.

John C. Roberts, Robert H. Fletcher, and Suzanne W. Fletcher, 1994. “Effects of peer review and editing on the readability of articles published in Annals of Internal Medicine,” Journal of the American Medical Association, volume 272, number 2, pp. 119–121.

William B. Weeks and Amy E. Wallace, 2002. “Readability of British and American medical prose at the start of the 21st century,” British Medical Journal, volume 325, number 7378, pp. 1,451–1,452.doi: http://dx.doi.org/10.1136/bmj.325.7378.1451, accessed 25 October 2016.

James A. Wells, 1994. “Readability of HIV/AIDS educational materials: The role of the medium of communication, target audience, and producer characteristics,” Patient Education and Counseling, volume 24, number 3, pp. 249–259.doi: http://dx.doi.org/10.1016/0738-3991(94)90068-X, accessed 25 October 2016.

How does the front page of the Internet behave? Readability, emoticon use, and links on Redditby Andrew Tsou.First Monday, Volume 21, Number 11 - 7 November 2016https://www.firstmonday.dk/ojs/index.php/fm/article/view/7013/5651doi: http://dx.doi.org/10.5210/fm.v21i11.7013