Scaling Social Science with Apache Hadoop

This post was contributed by researcher Scott Golder, who studies social networks at Cornell University. Scott was previously a research scientists at HP Labs and the MIT Media Laboratory.

The methods of social science are dear in time and money and getting dearer every day. — George C. Homans, Social Behavior: Its Elementary Forms, 1974.

When Homans — one of my favorite 20th century social scientists — wrote the above, one of the reasons the data needed to do social science was expensive was because collecting it didn’t scale very well. If conducting an interview or lab experiment takes an hour, two interviews or experiments takes two hours. The amount of data you can collect this way grows linearly with the number of graduate students you can send into the field (or with the number of hours you can make them work!). But as our collective body of knowledge has accumulated, and the “low-hanging fruit” questions have been answered, the complexity of our questions is growing faster than our practical capacity to answer them.

Things are about to change. We’re reaching the end of what philosopher Thomas Kuhn might call “normal science” in the social sciences — a period of time when scholarly progress grows incrementally using widely-accepted methods. This doesn’t mean an end to interviews, surveys, or lab experiments as important social science methods. Though questions about interpersonal behavior and small groups are undoubtedly still interesting, what we really want to know — what we’ve always wanted to know — is how entire societies work. The most interesting findings are going to have to come some other way.

“Computational social science” [1] represents a turn toward the use of large archives of naturalistically-created behavioral data. These data come from a variety of places, including popular social web services like Facebook and Twitter, consumer services like Amazon, weblog and email archives, mobile telephone networks, or even custom-built sensor networks. What these data have in common is that they grow as byproducts of people’s everyday lives. People email, shop and talk for their own reasons, without thinking about how the digital traces of their activity provide naturalistic data for social scientists.

That the data are created naturalistically is important both methodologically and theoretically. Though social scientists care what people think it’s also important to observe what people do, especially if what they think they do turns out to be different from what they actually do.When responding to survey or interviews, subjects might honestly mis-remember and mis-report the past. They might deliberately omit some things that embarrass them, or rationalize post-hoc and justify actions differently from how they reasoned about them at the time [2]. Collecting data on actual behavior is seen by many as the gold standard of social science, and experimental methods have had, and continue to have, many successes across the social sciences, including how people interpret probabilities in decision making, and how people develop beliefs about status hierarchies along racial, gender and other dimensions. But it was recognized long ago that findings within a lab might not generalize to the whole world. What we need to do now is measure the whole world in a controlled way. The web services named above do just that. Want to know how corporations really work? Look at their email [3]. Want to know about racial preferences in dating? Look at their online dating profiles (or even server logs) [4].

I believe that Hadoop is going to play a large role in analyzing these data and therefore in generating social science advances very soon. In the infancy of the social web, even successful systems had only thousands or tens of thousands of users (in contrast with tens of millions today), and creating an archive of all of the system’s data was as simple as doing SELECT *on each table in a MySQL database. But in an ironic twist, this method’s undoing would be the success of the social web itself. Though in Homans’ day, the questions grew faster than the data, today the data is growing faster than we can store and process it.

Enter Hadoop. Last year, I decided to invest some time in learning to write my data analysis processing programs using MapReduce. Cornell is lucky enough to have a project called WebLab whose resources include a 50-plus node Hadoop cluster, and I am lucky enough to be allowed to use it. As soon as I ran some test cases on it — single-process implementations of computations that took 4.5 hours on my beefy workstation took 3 minutes when implemented in MapReduce — I was sold.

In social network analysis, the research area I work in, the main questions of interest concern how patterns of social relationships affect individuals’ behavior and create social structure at the macro, or societal, level. Network analysts work in many areas of sociological interest, such as markets, employment, individual well-being, opinion formation, and others. Often, the structural properties of these networks are important predictors of individual behavior, but the computations required to calculate these measures is prohibitive. Right now, for example, I’m struggling to work with a comparatively large dataset comprising about 8 million people.

I learned quickly, it’s not the size of the data that kills you, it’s the size of the metadata. Thought it’s relatively easy to count the number of friends or neighbors everyone has, other calculations, such as the average number of “steps” between each pair of people, have much more demanding computational requirements. Algorithms that are O(n2) or bigger in their space or time requirements become prohibitive — a network with 8 million members has a whopping 64 trillion relations between (all pairs of) members. No individual workstation, no matter how fancy, is equal to such a task. With enough disk space and RAM, and some fancy programming tricks that repeatedly swap to/from disk only the data necessary for parts of computations, you might be able to process all that data, once, and it might take several weeks to do even that. Distributing the same computation over a large number of Hadoop nodes and finishing the process in minutes or hours means that it’s possible to iterate rapidly. Iterating rapidly means fixing bugs rapidly, and trying variations rapidly. I can process weighted and non-weighted versions of the same graph in quick succession, with only a small code change, for example.

Learning to process data using MapReduce is a skill that scales. The benefits of MapReduce over conventional programming is, in my opinion, equal to (or greater than) the benefits of conventional programming over analyzing data by hand. It takes a sizable initial time investment to learn to think in this way, especially if you haven’t been exposed to functional programming before (most non-computer scientists haven’t). But after getting the basic idea — the application of a series of transformations and compressions to data — the usefulness of the skill continues to grow naturally. The data is going to keep getting larger and more detailed, as more people experience more of their social and economic lives online. But the size of Hadoop clusters are going to get larger as well, and often increasing the number of nodes a job is processed on is as simple as changing one parameter in a configuration file. Academics can request access to the National Science Foundation’s TeraGrid system, and academics as well as recreational or corporate data crunchers can use MapReduce with their own clusters or cloud-based services like Amazon’s Elastic MapReduce.

Another of my favorite scholars, British sociologist Anthony Giddens, once remarked that because we live in a modern world in which people naturally reflect on their own behavior and the behavior of others, the professional sociologist is “at most one step ahead of the enlightened lay practitioner” [5]. And that was before we lived in a world of gigantic datasets. Besides their use in web search at Yahoo and Google, Hadoop and MapReduce have been touted as having tremendous potential in the area of business intelligence. The Economist just recently focused on this very issue [6]. Though corporations are generally not releasing their internally-generated data, governments and the media are starting to get into the act, with data.gov in the U.S. and Guardian Datastore in the U.K. User-contributed sites like Swivel and ManyEyes contain data sets of many different kinds, though of relatively small sizes. Clearly, many people are interested in questions of social scientific importance and the stories these data can tell us. I think that’s a really good thing, and I’m excited for the long-term prospects of both “professional” and “amateur” data analysis. In the same way that the DIY movement and publications like Make Magazinehave inspired laypeople to become interested in some of the principles and practices of engineering, public datasets can perhaps inspire interest in the social sciences. Right now, the datasets are small and not particularly interoperable, but I have some confidence that will change over time. Imagine a world in which mashups aren’t just songs and videos, but terabytes of data, where the data input path specified in a Hadoop configuration file isn’t a local directory containing one’s own data, but rather a URI pointing to some stranger’s (or company’s or government’s) publicly-available archives.

There’s a long way to go. Business practices, technologies and tools, and social science training each have years of advances to make before such a reality can become possible.

Until now, scientific computing has largely been the domain of the natural sciences, fields like fluid dynamics, astrophysics and bioinformatics. The computational social science revolution that is just beginning is mostly attributed to the growth in data available from the sources I’ve mentioned and surely from scores more, and I agree; you can’t have data analysis without the data. Another important part of that story is computation on a cheap, pervasive, distributed cloud, and tools like Hadoop to process and analyze it all.

Great post… I agree that normal social science is about to be disrupted by the potential for detailed analysis of human behavior at a scale never before possible. But the second profound change is likely to be a shift to new questions that are asked about human motivation, trust, empathy, privacy, and responsibility that now become visible. The real shift will be from study of the natural world to the made world, where design changes can dramatically influence future behavior… I call this: Science 2.0
(Science 319 (March 7, 2008), 1349-1350.) http://www.sciencemag.org/cgi/content/full/319/5868/1349

1. Hadoop scales but definitely not linearly (and scalability depends on a particular algorithm-task)
2. Communications in large cluster and both: sub par network I/O hardware of a “regular” server and Hadoop network/disk I/O implementation which is far-far from optimal are the reasons you will never be close to a linear scalability in a majority of applications.

To the question about scaling: note that the cluster’s machines are multi-core and, each by itself is more powerful and has more memory than my own machine. More to the point, they are different algorithms with different data structures, so of course the runtimes will be different. But thank you for raising the question.

I think you’re absolutely right about the future importance of studying the made world. One thing I think social scientists have been wrong about in the past generation is not focusing enough on our built environment as a causal factor in behavior change. Midcentury, people like William H. Whyte and Stanley Milgram thought about the cognitive and behavioral relationships with space use, and I fear we’ve lost that somewhat. It’s definitely important for the current generation of CMC in which social spaces are so carefully and heavily architected.

Also, as you say in the Science piece, we do have to develop methods that move beyond description and case study, toward inference and hypothesis testing.

Thank you Scott for your inspiring article. I’m close to a CS bachelor and would like to apply myself in Social Science. So my university choice wasn’t that far and unlogical as it could appear at the beginning.