A few months back I started a crowdsourced list of social media data collection tools. A few folks tweeted it today, which reminded me that I never posted it to my website. At the time I had visions of lovingly curating it and maybe even making it searchable based on one’s specific research criteria and programming abilities… but the standard tenure-track priorities quickly caught up with me. Still, some folks seem to find it useful, so here it is:

Big Data. Computational social science. Data science. Analytics. These buzzwords are everywhere these days—business, government, the nonprofit sector, you name it—and the social sciences are no different. The question of what to do about the explosion of datasets and -sources far larger than the local norm has moved to the center of many disciplines. Recently in my own field of communication, we’ve seen specialissues of severaljournals devoted to computational social science (my term of choice; hereafter CSS), a growing number of faculty openings, and enough panels at various annual meetings to fill out a mid-size conference unto itself.

But it’ll take more than this to do CSS effectively, and by “do” I mean “conduct research and teach at a world-class level.” These are goals that can, and should, be implemented at the departmental level—just as some communication departments are known for their expertise in survey or rhetorical methods, an enterprising upstart could become the first to gain fame for excellence in CSS. Such a department would need to build strength in at least four distinct areas: faculty, curriculum, hardware, and data. A few departments have addressed one or two of these at some level, but I don’t know of a single one (at least in the US) that shines in all four. It can’t all be done cheaply, but if you believe as I do that CSS looms large in comm’s future, it’ll be well worth it.

But I don’t expect you to take me at my word, so before I go into detail on the four areas, I’d like to justify the enterprise a bit. Lots of strong rationales come to mind but here are three of the most important:

There are some kinds of analysis you can’t do any other way.

Computational skills open an entirely new dimension of empirical possibilities to their practitioners. This dimension holds the potential to radically transform every step of the research process—data acquisition, preprocessing, analysis, visualization, and interpretation. Here I’ll offer two specific examples demonstrating different aspects of this general point.

First, CSS practitioners often denigrate the act of preprocessing raw data into more manipulable forms as “data janitorial” work. This metaphor is extremely misleading: preprocessing determines which analytical methods can be applied to a given dataset, and therefore an expert “data janitor” has many more analytical options at her disposal than one who lacks such skills. For example, one of the first steps in network analysis of Twitter data is the task of converting tweet text into formats suitable for network analysis. NodeXL, which offers perhaps the user-friendliest means of doing so, automatically creates network edges between tweet authors and any usernames included in their tweets. The program can distinguish between “replies” (created by using Twitter’s “reply-to” function) and “mentions” (created by simply including another user’s name in a tweet), but not retweets, modified tweets, CCs, or other referential conventions. The ability to make such distinctions is important given research that shows meaningful differences in how these conventions are used (e.g. Conover, 2011). I don’t know of any off-the-shelf software that can do this, but it’s a trivial task in most programming languages. The broader point is that relying on off-the-shelf software tends to sharply limit researchers’ data manipulation options.

The second point can be explained very briefly. Many of the most powerful tools for analyzing digital data are modules or libraries for use within different programming environments. A few Python libraries comm researchers might find useful include pandas, scikit-learn, statsmodels, NetworkX, and my own TSM. But working knowledge of the language is a prerequisite for their use.

The field of communication is uniquely positioned to apply CSS in innovative ways.

Computer science and information science already have long head starts on CSS compared to the social sciences. Many of the best CSS tools were created by the students, graduates, and faculty of such departments, some of whom already study communication phenomena such as the flow of news memes online (Leskovec et al., 2009) and partisan polarization in social media (Conover et al., 2011). So one possible response to proposals to build CSS strength in comm departments is: well, CS and IS are the experts here—how could we do better than them? The answer is: in areas of relevance to communication theory and practice, we have a couple distinct advantages.

First, most computer and information scientists lack the theoretical background to explain the meaning and significance behind their findings. Their research orientation is informed primarily by the priorities of engineering, which include speed, accuracy, efficiency, and algorithmic elegance (Freelon, in press). As such, many are more concerned with chasing the cutting edge of software development than with explaining social phenomena. (I’m talking about general trends here—I don’t want to dismiss those CS and IS scholars who have reached across disciplinary lines to produce excellent social scientific research.) In contrast, we marshal our methods in service of communication theory and practice—CSS is no different in this than depth interviews, surveys, or ethnographies. In short, their comparative CSS advantage is in the development of new software and techniques, whereas ours lies in using those tools to analyze and explain communication phenomena.

Second, our capacity for methodological pluralism, particularly the combination of CSS and qualitative methods, is greater than in the engineering sciences. While pluralism is by no means unknown among them, as a group they strongly privilege algorithmic and automated methods. Communication researchers are comparatively more comfortable mixing methods and can more easily apply qualitative and CSS methods to complex research questions. A couple of my own forthcoming papers offer examples of how this can be done (Freelon & Karpf, in press; Freelon, Lynch, & Aday, in press). As a field we are uniquely positioned to cultivate a strong dialectic between macro (CSS) and micro (qualitative) empirical levels that raises the quality of our theoretical explanations.

PhD (and master’s) graduates will be strong candidates for both academic and non-academic positions.

Tenure-track faculty positions are in short supply across academia. Comm is actually doing better than some fields in this regard, but there still aren’t nearly enough TT jobs for all qualified candidates. Training comm master’s and PhD graduates in CSS can be one part of the solution. More than other methodological specializations, CSS training prepares students for jobs outside the academy. The end of this article includes several lists of essential skills for industry-focused data scientists, and many of these include some variant of “communication skills,” “storytelling,” “curiosity,” “visualization,” and/or “domain knowledge.” These non-technical capabilities are already part of most decent PhD programs—add key technical components and you’ve got most of the skills employers are looking for in a data scientist. Comm graduates would obviously be best suited to working in communication-related industries such as PR, journalism, advertising, and social media. Indeed, a handful of comm PhDs have already been hired by major social media and tech companies (e.g. David Huffaker, Lauren Scissors, and Loi Sessions Goulet), although not all are in CSS. We could make this a more common occurrence.

Comm also has an opportunity to make some of its unique insights relevant to industry. For example, to avoid the problematic assumption that digital traces such as Facebook likes and retweets have fixed meanings (authority, influence, endorsement, etc.), we can point out when such assumptions are more and less likely to hold (Freelon, 2014). Similarly, we can help hold a critical eye to companies like Klout that claim to measure concepts such as “influence” using proprietary formulas of unknown validity. Closely scrutinizing such practices holds real business value: it’s important to know whether a given product actually measures what it claims to measure before buying or using it.

All right; now that I’ve sold you on the general prospect, let’s move to the four key areas for CSS.

Faculty

This one’s pretty obvious—several comm departments have recently hired in CSS (UPenn, UW-Seattle, and UMD-College Park, among others), and this will likely continue into the foreseeable future. Trouble is, you can’t just hire one prof and call yourself a CSS powerhouse. Critical mass is needed—probably at least three faculty and preferably more—to support multiple courses, advisees, and research projects. Eventually, you want students to look at your department and think “wow, look at all the CSS faculty they have; seems like a really supportive place for that kind of work.” Ideally your faculty would specialize in diverse areas of CSS such as machine learning, network analysis, visualization, predictive modeling, etc. But all should be ready, willing, and able to apply these skills to communication research questions. That doesn’t necessarily go without saying: most CSS PhDs don’t have a comm background, and many don’t care much about doing comm research. But giving those that do a supportive work environment will be critical in nurturing the next generation of comm CSS scholars.

Which brings me to my next point: not only must our enterprising department hire CSS faculty, it must also promote them. This means broadening its tenure guidelines to include more than just books and journal articles. For example, many of the top publication venues for CSS are not traditional journals but “archived proceedings” from conferences like CHI (Computer-Human Interaction), CSCW (Computer-Supported Collaborative Work), WWW, and ICWSM (International Conference on Weblogs and Social Media). These have acceptance rates on par with peer-reviewed journals and should be evaluated similarly. Our department might also consider the unique value of CSS-friendly open-access publications such as PLoS One and Big Data & Society, which help spread new discoveries across disciplinary lines. Other nonstandard scholarly contributions that should be made explicit in CSS-friendly promotion guidelines include creating interactive online visualizations, contributing to open-source research software projects, and curating large, high-quality datasets. Many of these contributions will yield greater scholarly impact (in terms of use and citations) than the typical journal article.

Exactly how much these contributions should count is up for debate. I certainly don’t think anyone should be able to earn tenure on visualizations alone, but if they provide scholarly value, they should count for something. This is part and parcel of signaling to CSS faculty that their work is valued—and we all know what happens to talented researchers who don’t get that message.

Curriculum

CSS faculty must be given the latitude to teach in their area(s) of methodological expertise. But our department needs more than just a single introductory-level CSS course. Quantitatively-oriented comm grad students often take three or more stats courses, and those who want to learn research-grade CSS should have similar options. One effective way to start would be to offer a multi-course CSS track similar to the statistics tracks many departments currently offer. Such a track could start with an introduction to Python or R and continue with courses in data manipulation, visualization, machine learning, and/or statistical modeling. Successful completion of the track could earn the student a master’s or PhD certificate in CSS.

It bears emphasizing that any comprehensive CSS curriculum needs to start by teaching students how to code. Our department will not be able to assume that students will enter knowing how to code, just as most currently don’t assume any particular level of statistical knowledge. This isn’t something that can simply be outsourced to the computer science department—communication students will use code for very specific purposes that computer scientists don’t always understand. In addition, learning how to apply computer programming to communication research questions from the start will help keep students motivated and stem the high attrition rates that plague traditional CS education.

Hardware

Like video and sound production, CSS is an infrastructure-intensive enterprise. Small-scale projects can be executed cheaply on repurposed in-house servers or low-capacity virtual cloud servers, but our lofty goals require a much more substantial capital investment. There are two general directions we could go here: the first is to commit to paying a company like Amazon a monthly fee for a dedicated chunk of virtual computing resources for data collection, analysis, and storage. The major advantage of this approach is convenience: the cloud provider handles all the administrative details so that all our faculty and students need do is login and get to work. But going the cloud route is like paying for web hosting: you lock yourself into a long-term relationship with your provider, which means we need to be rich enough to pay it indefinitely. And deciding to switch providers or move to an in-house option down the road is a logistical nightmare proportional to the amount of time spent with our original provider.

The other option is to use university-hosted hardware. The biggest advantage here is cost—the initial capital investment on the machines is a one-time expenditure. This consideration alone may make it the only feasible option for less wealthy departments. There are a number of ways to self-host, each with its own set of issues. Some universities make high-performance computing clusters (HPCCs) available to the entire campus—depending on the exact setup, our department could outsource some or possibly all of its computing needs to it. Obviously this would be very attractive from a budget perspective, but other departments will almost certainly be using the cluster already, which will limit available processing capacity. There may also be other limits on who is allowed to use it, what kinds of software can be installed, how much data can be stored, and how the system is allowed to access the Internet, among others. We would need to have a long conversation (probably several) with the HPCC administrator to determine the extent to which it will suit our needs.

The other self-hosted option would be for the department to build its own small server cluster. This would maximize control and configurability but also require active management and monitoring. Ideally this could be done by someone on the department’s IT staff; it’s not the sort of thing faculty or students should spend their time on. But that probably means adding to an existing staff person’s workload, which may entail a pay raise. Alternatively, if there’s room in the budget, the department could hire a full-time staff person to handle things like cluster administration, purchasing, keeping the disk images up to date, troubleshooting, user management, basic training, etc.

(A quick note about software before I move on to data: most CSS software is FLOSS, and your faculty will know what’s best to use, so it’s not a major planning concern. But if there are specific packages that need to be purchased, those can be added to the data budget, which will almost certainly be much larger.)

Data

There are three basic ways to obtaining CSS data: you can collect it yourself, you can buy it, or you can make it. Collecting data in-house is cheaper but more time-consuming and error-prone, while buying it costs money but usually results in better quality. To take social media data as an example, many platforms restrict the amount of data that can be extracted from their public APIs as a quality-of-service measure. As a result it’s difficult to know just how representative self-collected samples are. Purchasing data from an authorized data vendor such as Gnip also buys you some degree of assurance that you’re actually getting all data relevant to your sample frame. For example, if you were to collect tweets from the #Ferguson hashtag using a harvesting server like 140dev, you’d have no idea whether your data were representative or how many tweets you were leaving behind. But purchasing the data allows you to obtain all of the relevant data for whatever time period you’re interested in (at least in theory).

There are also many non-social media types of data of interest to communication researchers that can be purchased. Companies like Nielsen, Comscore, and Alexa sell high-quality audience measurement data for the non-social web. Nielsen sells comparable data for TV (as they have for decades), books (Nielsen BookScan), and music (Nielsen SoundScan). Many TV news transcripts are available through a pay source most comm departments already have access to—LexisNexis. I’m sure there are many other sources I’m not aware of, but this brief list conveys a sense of what’s available to departments with research budgets.

Lastly, some CSS researchers generate their own data by measuring user interaction with bespoke sociotechnical systems. The tradition of computer-based experiments actually has a longer history in communication than many realize (e.g. Sundar & Nass, 2000). Probably the main logistical issue here is the provision of lab space for small-scale, in-house computational experiments. Such resources can also be used to pre-test measures and instruments for later use in online experiments where many factors lie outside the researchers’ control.

Concluding thoughts

As noted earlier, checking all these boxes can’t be done on the cheap. The total cost must be tallied not only in money but also in non-monetary transition costs and (potentially) resistance from skeptical colleagues. There are no guarantees when it comes to shifts of this magnitude—failure’s always a possibility, especially if all the necessary resources don’t come through for one of the four areas. Moreover, there’s other important work that needs to occur at the disciplinary and interdisciplinary levels, including the establishment of official sections in the major professional orgs, specific initiatives to increase CSS visibility in top research outlets, and discipline-spanning institutes that bring together practitioners from across campus. All that said, it seems extremely unlikely to me that the importance of analyzing digital communication data through programming will wane in the near future. If I’m correct, the first comm department to do CSS effectively will emerge as a nationwide model for the discipline and beyond. Sounds like a place I’d like to work.

[View the map.You’ll need an up-to-date browser with Javascript enabled.]

In an effort to better understand the theoretical landscape of my chosen academic field, I have created a co-citation network visualization based on bibliographies found in nine major communication journals. The nine journals chosen are some of the best-known and longest-running in the field:

Communication Research

Communication Theory

Critical Studies in Media Communication

Human Communication Research

Journal of Broadcasting & Electronic Media

Journal of Communication

Journal of Computer-Mediated Communication

Journalism & Mass Communication Quarterly

Political Communication

Co-citation is a well-established technique for mapping academic disciplines (among other applications). The basic idea is that two publications are considered linked or “co-cited” when they appear in the same article’s reference list. After a basic co-citation network is created, a community-detection algorithm can be run to generate an organic impression of a discipline’s major subtopics and authors. In this map, the co-citation communities identified by the algorithm are grouped together by color. I doubt the specific groupings will surprise any seasoned scholars, but they will certainly help beginners (like me) get a sense of what our colleagues in other divisions have been thinking about over the past decade.

Maps similar to this one have been created for sociology and philosophy, and I credit those authors for giving me the idea to create this one. In doing so I relied heavily on Neal Caren’s excellent Python script for scraping citation data from Web of Knowledge (WoK). In the next section I give a guided tour of the map, after which I provide additional methodological details.

The map

One of the first things you’ll notice about the map is that publications are listed by first author only. This is how WoK stores references, but in most cases it shouldn’t be too hard to figure out which article or book is intended. Also, a few very popular articles probably have at least one duplicate node–I did not attempt to clean this dataset because I couldn’t figure out a non-manual way to do so.

Only highly-cited items appear on this map, a decision made for the sake of both parsimony and technical limitations. In order to make the initial cut, a publication had to 1) have at least ten citations according to WoK and 2) be co-cited on at least five reference lists with another publication meeting the first criterion. In this way, a network of 80,880 unique cited publications* and 3,878,211 co-citation links drawn from 2,834 seed articles was whittled down to 1,124 pubs and 6,092 links.

If you mouse over a given publication you’ll see the others to which it is connected. A link between two publications means that the two are co-cited at least five times. Thicker links mean more co-citations. Intra-community links share the community’s color; inter-community links take on one of the two community’s colors at random. A publication’s node size reflects the number of bibliographies in which it appears.

The nine colored communities in this network represent the nine most densely-interlinked subtopics addressed in the journals. The community detection algorithm identified a total of 28 link clusters, so nine is an arbitrary number (I had to stop somewhere). These top nine represent about a third of the communities found, but this third contains 89.5% (1,006/1,124) of all the pubs that met the initial inclusion criteria.

Here I give each community a label and a short description, but I can’t claim expertise on all of them, so corrections and suggestions are welcome.

__Interpersonal communication, offline and on. Unsurprisingly, this community was well-represented in JCMC. It incorporates pieces from both the digital age and long before, with Walther, Berger, and Knobloch being especially prominent. Classic works by Goffman, Spears, Altman & Taylor, and Parks & Floyd can also be seen.

__Race and media. One of the smaller communities, this one builds on foundational work in both media studies and effects (e.g. Entman, Dixon, Gilliam, Valentino) and psychology (Fiske, Devine). Much of it focuses on pejorative perceptions of African Americans by whites.

__Parasocial interaction/uses &gratifications. Drawing heavily on psychologists such as Bandura and Fishbein, this cluster examines how and why people consume media (especially popular media) as well as their relationships with the characters on the screen. (This is one of the ones I know less well, so let me know if there’s a better way to describe it.)

__Selective exposure. From the foundational work of Festinger and Sears & Freedman in the 1950s and 60s to Sunstein’s Republic.com, this community focuses on how people select, reject, and justify media content and the consequences for their opinions, beliefs, and emotions.

__Multimedia information processing/knowledge gap. This cluster is heavily anchored in the work of Lang, Grabe, Reeves, and Newhagen. Its objects of study are the influence of multimedia on cognition, specifically memory, emotion, and knowledge. The knowledge gap concept is also prominent here. (Again, I’m not an expert here, so please correct as appropriate!)

__Civic engagement/political participation/deliberation/social capital. This cluster is concerned with the roles of media and communication in citizens’ engagement with politics and their communities. The second largest community by internal links, it incorporates leading research from sociology (Coleman, Wellman, Granovetter) and political science (Putnam, Norris, Huckfeldt) in addition to communication.

__Psychology of communication/cultivation theory/statistical methods. This cluster shares a few links with the “visual images” and “parasocial interaction” cluster but is distinct from both. With Petty & Cacioppo’s classic book on the elaboration likelihood model as its primary anchor, this research investigates concepts such as information processing, emotion, persuasion, influence, and attitudes as they pertain to communication. Interestingly, major pieces on statistical analysis by Holbert & Stephenson, Bollen, and Baron & Kenny are also included here.

__Third-person effect/hostile media effect. This community is home to the closely-related hostile media and third-person effects, both of which involve people’s beliefs about how media messages relate to others. Though its originator (Davison) was a scholar of journalism and sociology, later third-person effect research increasingly relies on concepts borrowed from psychology (e.g. Eveland, Nathanson, Detenber, & Mcleod, 1999; Henriksen & Flora, 1999; Hoffner et al., 1999).

__Agenda-setting/framing/priming. In a development that will surprise no one, the largest cluster by far is devoted to the study of three interrelated media effects: framing, priming, and agenda-setting. The major works and authors here will be known to nearly all students of mass communication: Iyengar, Entman, McCombs, Zaller, Gamson, Shoemaker, Bennett, Price, Scheufele, and many more…

There is much to say about these clusters–much more than I have time to articulate–so I’ll limit myself to an observation and a related caveat. First, critical theory is conspicuous in its absence from these clusters. Marx, Foucault, Adorno, Williams, Baudrillard, Butler, and other critical stalwarts are nowhere to be found among this list of landmark works. Among those critical theorists who do make the cut are Chomsky, Habermas, Hall, and Bourdieu, though I leave to the reader the exercise of finding them on the map.

One reason for the omission may be the use of the journal as the sampling unit. Much critical work is published in books, and while many books appear on the map, it is clear that journal articles largely tend to cite other journal articles. And in communication, the better-known journals tend to publish work that is quantitative, empirical, epistemologically social-scientific, and American in focus. So the major caveat for this map is that it almost certainly underrepresents work that is qualitative, purely theoretical, critical, and non-American. Unfortunately, there is no easy way to integrate books into it, and even if there were, there is no preexisting list of the most-cited books in communication.

Additional method notes

From each journal, all reference lists from all research articles (specifically excluding book reviews and similar) available in WoK between 2013 and 2003 were extracted on September 3, 2013. A few items from 2002 were included for some journals.

For those who are interested, here is a quick summary of how I created this map:

Downloaded full reference lists from WoK for all articles (excluding book reviews etc.) published between 2003-Sept 2013 from the above journals in plain-text format

The raw data for the map (1,124 nodes/6,092 edges) can be downloaded here.

If you have any questions about how I made the map, I’d be happy to answer them. Also, if you have suggestions for additional journals to add, let me know and I may be able to do it–but GEXF.js is limited in the amount of network data it can display so there’s no guarantee.

*The true number is somewhat less than this, as some pubs are listed under different names due to incompatible citation practices and miscellaneous citation errors.

I just completed a new version of the Python version of T2G which adds a few new features. Most prominent among these is the ability to extract only retweets or only mentions for visualization in Gephi. Recent research has shown substantive differences between networks based on these behaviors, so it is important for researchers to be able to distinguish between them. The new version also fixes a bug that halted processing whenever two @s appeared adjacent to one another in a tweet (i.e. “@@”).

A few weeks ago I posted a spreadsheet that converted tweet mention data into Gephi format for social network analysis. A key limitation of that spreadsheet is that it only converts the first name mentioned in each tweet, discarding the rest. For example, for the following tweet:

that spreadsheet would pull @alexhanna into the Gephi file as one of my mentions but not @cfwells or @kbculver.

To remedy this issue, I’ve created T2G, a solution that converts all Twitter mention data fed into it to Gephi format. T2G comes in two flavors, Python and PHP, each of which does the same thing. The PHP edition is more user-friendly, while the Python edition is faster and easier to set up. All you need to do is supply a CSV file containing two columns: the first (leftmost) filled with the tweet authors’ usernames, and the second filled with their corresponding tweets. You’ll find additional instructions in an extended comment at the top of each script. Please ensure that you have the appropriate interpreter installed (PHP or Python) before trying to use either of these scripts.

Both of these scripts produce equivalent output, albeit in a slightly different order (you can rank the data in alphabetical order to check if you like).

You can test a “lite” version of the PHP edition below–it will convert only the first 100 tweets in your file. Feel free to test it using this sample file, which contains some of my own recent tweets formatted to the above specifications.

A couple recent articles have gotten me thinking about methods for geolocating Twitter users: Kalev Leetaru et al.’s recent double-sized piece in First Monday explaining how to identify the locations of users in the absence of GPS data; and the Floating Sheep collective’s new Twitter “hate map,” which has received a fairamount of mediaattention. The ability to know where social media users are located is pretty valuable: among other things, it promises to help us understand the role of geography in predicting or explaining different outcomes of interest. But we need to adjust our enthusiasm about these methods to fit their limitations. They have great potential, but (like most research methods) they’re not as complete as we might like them to be.

Let’s start with the gold standard: latitude/longitude coordinates. When Twitter users grant the service access to their GPS devices and/or cellular location info, their current latitude and longitude coordinates are beamed out as metadata attached to every tweet. Because these data are generated automatically via very reliable hardware and software, we can be reasonably certain of their accuracy. But according to Leetaru et al., only about 1.6 percent of users have this functionality turned on. Due to privacy concerns, Twitter offers it on an opt-in basis, which partly explains the low level of uptake.

For the social science researcher, relying on lat/long for geolocation in Twitter raises a major sampling bias issue: what if users who have this feature turned on differ systematically in certain ways from those who don’t? Here are a few plausible albeit untested (as far as I know) characteristics that geolocatable social media users may be more likely to exhibit:

I’m sure you could probably come up with more. The point is, we cannot assume that, for example, a map containing geotagged racist/homophobic/abelist tweets faithfully represents the broader Twitter hate community. If all hate-tweets came geotagged, the map might look very different, especially since as things stand now many are smart and motivated enough to render their hate less visible.

Fortunately, lat/long coordinates are not our only options for trying to figure out where tweeps are. Leetaru et al. helpfully offer a series of methods for doing so in cases where this information is absent (other stabs at this include Cheng et al., 2010 and Hecht et al., 2011). Leetaru et al.’s most effective methods focus on the freetext “Bio” and “Profile” fields, which when combined increase the number of correctly IDed locations at the city level to 34%. This represents an increase of more than an order of magnitude over what lat/long alone allow as well as a very cool research finding in its own right. However, the sampling bias problem applies with nearly equal force to this enhanced data: the strong possibility still exists that the nearly two-thirds of unlocatable tweeps differ in critical ways from those whose locations can be identified.

So, what to do? Ideally, to be able to generalize effectively, we want to be able to say that the individual-level characteristics and overall geographic distribution of our geoidentified users resemble those of a representative sample of all users within our sampling frame. But this is a very tall order methodologically, and even if we could accomplish it, the results would likely disappoint us.

Our options at this point depend upon the required level of location granularity: the coarser this is, the better we’ll be able to do. If for example we only need country-level data for a fairly small N of countries, we can take advantage of the fact that it is easier to identify a user’s country than her city. One strategy here would be to start with string-matching methods like those used by Leetaru et al. and Hecht et al., which attempt to identify locations listed in various Twitter fields using dictionaries of place names. Next, for users whose locations can’t be thusly identified, a less definitive machine-learning method could be substituted to guess locations based on tweet text. This second method has the notable disadvantage of forcing a location guess for each user, introducing the issue of misidentification, whereas the first simply leaves unlabeled all users that don’t yield conclusive dictionary matches. (It is also more computationally intensive due to the higher volume of data required and the complexity of most machine-learning algorithms.) Nevertheless, Hecht et al achieve between a 89% and 73% accuracy rate using this method at the country level (depending on how the data are sampled), suggesting that it could help researchers address the sampling bias issue in some scenarios. It would probably suffice to identify relatively small randomly-selected subsamples for each country of interest using machine learning, compare them to those IDed via string-matching, and search for major differences between each pair of groups.

The prospects for determining the representativeness of geolocated users at more specific locations than their countries are much slimmer. The accuracy of Hecht et al’s machine-learning geolocation technique drops from 89%-73% at the country level to 30%-27% at the US state level. Extending this logic, it’s probably safe to assume an inverse correlation between algorithm accuracy and location specificity when 1) location info is sparse in the data (as with tweets) and 2) the set of possible locations is very large or unbounded. Under these conditions I can’t think of how one might go about measuring how representative the known locations are of the unknowns. (If you have any ideas, leave a comment!) At that point you might simply have to grant that you can’t say much about how representative your sample is, and justify your study’s contributions on other grounds.

EDIT 05/15/13: I’ve posted two scripts, one in PHP and one in Python, that overcome the main limitation of this spreadsheet–they pull in all mentioned names rather than just the first one. Download one or both here.

If you’ve ever wanted to visualize Twitter networks but weren’t sure how to get the tweets into the right format, this spreadsheet I’ve been using in my classes might be worth a try. It prepares Twitter data for importing into Gephi, an open-source network visualization platform. It requires a little cutting and pasting, but once you get the hang of it you’ll be visualizing social network data in no time. Here’s the link:

As the nation waits to find out who our next president will be, I thought it would be interesting to take a quick look at how Obama’s and Romney’s Facebook followers reacted to content posted to the candidates’ official Facebook walls. As part of a larger research project, I’m extracting all public comments posted to both walls between April 25 (the day the RNC endorsed Romney) and November 2, 2012. While doing so, I noticed some clear patterns in the kinds of content each group of followers showed most interest in. By charting the numbers of likes, shares, and comments for each message during the aforementioned time period, we can get a sense of when attention spiked and how much. Examining the top five most liked, shared, and commented-on posts reveals what topics attracted the most Facebook attention during the final leg of the campaign. Let’s start with Obama:

As you’ve probably already noticed, I’ve included the associated text for the top five most-liked posts in the dataset and the images associated with the first four in chronological order. (I didn’t have room for the fifth image, but the text speaks for itself best among the five.) The first thing that jumped out at me here was how none of the top five most-liked posts had anything to do with politics–they were scenes from the Obamas’ family life, the kinds of moments that could be found in any American family photo album. The wholesome sentiments these shots convey couldn’t be farther from the knock-down drag-out negativity flooding the airwaves and the Internet throughout the timeframe, which may explain why they were so popular among Obama fans.

All of Romney’s top five most-liked posts were direct calls to push their “like” count over some numerical threshold. Romney’s fans seem to be more goal-oriented than Obama’s: rather than reveling in idyllic family scenes, they were most interested in showing off their support for Romney to their Facebook friends. One broader interpretation here is that Romney’s Facebook fans were more engaged in the campaign than Obama’s, who seemed less inclined to get political. This is also reflected in the fact that although Obama had much higher median numbers of likes (111,231 vs 64,182), shares (11,753 vs. 3644), and comments (7309 vs. 4376) than Romney during this period, Romney had much higher “like” peaks. (Romney posted over twice as many messages than Obama, so his “like” totals are higher: 58.5M to 42.7M.)

Likes vs. shares vs. comments

How did the most-liked messages stack up against the most-shared and most-commented-on messages? Let’s have a look, starting with Obama:

Michelle’s biggest fans watching her convention speech from home last night.

554713

Share this with your friends and family if you support this plan to keep us moving forward.

63551

Share if you agree: President Obama won the final debate because his leadership has made America stronger, safer, and more secure than we were four years ago.

39469

My quick read on this table is that Obama supporters use shares and comments much more politically than they use likes. The top five most-shared messages are all about general support for Obama as opposed to specific policy issues. Comments are a mixed bag, with the top spot going to a call for birthday wishes and the fourth spot containing the sole policy statement in the entire table. The remaining most-commented posts are similar in nature to the most-shared.

Romney’s most-shared and -commented posts are both similar to and different from Obama’s:

Rank

LikedText

NLikes

SharedText

NShares

CommentedText

NComments

1

We’re almost to 5 million likes — help us get there! ‘Like’ and share this with your friends and family to show you stand with Mitt!

1167589

We don’t belong to government, the government belongs to us.

93329

We’re almost to 5 million likes — help us get there! ‘Like’ and share this with your friends and family to show you stand with Mitt!

105839

2

We’re almost there – Help us get to 10 million Likes!

1112300

It’s the people of America that make it the unique nation that it is. ‘Like’ if you agree that entrepreneurs, not government, create successful businesses.

70373

The American people know we’re on the wrong track, but how will President Obama get us on the right track? http://mi.tt/S8WQWZ

66622

3

Stand with Mitt. ‘Like’ and share to help us get to 6 million Likes!

986653

We’re almost to 5 million likes — help us get there! ‘Like’ and share this with your friends and family to show you stand with Mitt!

62905

Like and share to help us get to 8 million likes!

62691

4

Help us get to 7 million likes! ‘Like’ and share to show you’re with Mitt.

719837

Like and share to help us get to 8 million likes!

46524

The path we’re taking is not working. It is time for a new path. Donate today and help us get America back on track http://mi.tt/QZkDpL

52059

5

Like and share to help us get to 8 million likes!

614492

I intend to lead and to have an America that’s strong and helps lead the world. ‘Like’ and share if you will stand with me.

39652

We don’t have to settle. America needs a new path to a real recovery. Contribute $15 and help us deliver it. http://mi.tt/Tacap0

47732

Romney’s most-shared messages are similar to Obama’s in their lack of specificity. Unlike Obama, there is some overlap between the three modes of interaction–at least one “help us get to X million likes” post shows up on each list. The most interesting thing about the most-commented posts is that three of the five posts are pretty clear attacks on Obama, while I see a couple of Obama’s most-commented as indirect attacks at best. The idea that “everybody deserves a fair shot, not just some” could be a shot at Romney’s supposed elitism, but the claim that Obama’s “leadership has made America stronger, safer, and more secure than we were four years ago” is more about Obama than about Romney.

Closing thoughts

I think these data show some definite patterns in the types of engagement the Romney and Obama Facebook pages elicited. One important point about these data I want to stress is that they say much more about each campaign’s supporters than they do about the candidates. For example, Obama asked his supporters to like and share content, and Romney talked about his family, but those posts didn’t resonate as much with their followers. I also find the contrast with Twitter quite instructive–many studies, including my own research, have found Twitter activity to be highly event-driven, spiking when big stories break. Activity on the candidates’ walls looks to be much less so–few of the top five messages reference time-specific events, and few were posted on milestone days for either campaign. So it looks to me like the campaigns have a much greater capacity to drive attention with particular types of content on Facebook than on Twitter, which functions as more of a real-time information distribution network.

If anything in particular jumps out at you in this data or you disagree with any of my interpretations, I’d love to hear about it in comments.

Update 2/21/2012: As my colleague Alex Hanna recently informed me, up to 2% of the archives below may consist of duplicate tweet IDs. If you intend to work with this data, I highly recommend removing all the duplicates first.

Since the release of the PITPI report in September 2011, several scholars have expressed interest in obtaining the Twitter data for their own research. I refused these requests categorically on the grounds that they would violate Twitter’s terms of service. However, one of these scholars asked Twitter directly about what data is allowed to be shared and what is not. He discovered that distributing numerical Twitter object IDs is not a violation of the TOS. I quote in full the message Twitter sent him below:

Hello,

Under our API Terms of Service (https://dev.twitter.com/terms/api-terms), you may not resyndicate or share Twitter content, including datasets of Tweet text and follow relationships. You may, however, share datasets of Twitter object IDs, like a Tweet ID or a user ID. These can be turned back into Twitter content using the statuses/show and users/lookup API methods, respectively. You may also share derivative data, such as the number of Tweets with a positive sentiment.

Thanks, Twitter API Policy

Consistent with this official message, I have bundled up the numerical IDs for all of the users and individual tweets contained in all of the archives I have. They can be downloaded from the links below. The files are in CSV format, and each row contains both a tweet ID (left column) and the ID of the user who posted it (right column). As the message above notes, these IDs can be used to retrieve the complete datasets directly from Twitter via its API. All of these archives are based on Twitter keyword searches (except for those containing hash marks, which are hashtag archives) and were amassed between January and March 2011, except for “bahrain” which spans nearly a year beginning at the end of March 2010. None of these archives should be considered exhaustive for the months they cover, as TwapperKeeper was limited in its tweet collection capacity both by its own hardware and by Twitter’s API query restrictions. With those caveats out of the way, here are the data:

And even with this knowledge, recreating the full data sets would still take months of 24/7 automated querying given Twitter’s API limits.

Like many Twitter researchers, I reacted with dismay when Twitter changed its TOS last year to sharply restrict data sharing. To this day I struggle to think of a valid reason for the change, especially since their APIs remain open to anyone with the skills to query them. Nevertheless I feel bound to respect Twitter’s TOS, not so much out of fear for the consequences (although recent machinations at the DOJ may soon change that), but because so much social research depends critically upon the assumption that researchers will act according to the wishes of their subjects. The principle is similar to source confidentiality in journalism: if it became common practice for reporters to publish “off-the-record” information, sources would stop talking to them after awhile.

I have no idea what the prospects for getting Twitter to revert their TOS are since I don’t know why they changed it in the first place. However, if you would like to see this happen, you might consider leaving a comment to that effect on this blog post detailing why you think it’s important. If nothing else, such comments might convey a sense of some of the different ways Twitter’s API policy is hampering research, and may also start a conversation about possible workarounds or other ways of resolving the situation.

Just a brief announcement—shortly before New Year’s I earned my Ph.D. I have also updated this site’s front page to indicate that I am now an assistant professor at the School of Communication at American University in Washington, DC. Thanks for all your support.