Twitter releasing trove of user data to scientists for research

Data uses include measuring happiness levels in cities based on Twitter images shared.

Twitter has a 200-million-strong and ever-growing user base that broadcasts 500 million updates daily. It has been lauded for its ability to unsettle repressive political regimes, bring much-needed accountability to corporations that mistreat their customers, and combat other societal ills (whether such characterizations are, in fact, accurate). Now, the company has taken aim at disrupting another important sphere of human society: the scientific research community.

Back in February, the site announced its plan—in collaboration with Gnip—to provide a handful of research institutions with free access to its data sets from 2006 to the present. It's a pilot program called “Twitter Data Grants,” with the hashtag #DataGrants. At the time, Twitter’s engineering blog explained the plan to enlist grant applications to access its treasure trove of user data:

Twitter has an expansive set of data from which we can glean insights and learn about a variety of topics, from health-related information such as when and where the flu may hit to global events like ringing in the new year. To date, it has been challenging for researchers outside the company who are tackling big questions to collaborate with us to access our public, historical data. Our Data Grants program aims to change that by connecting research institutions and academics with the data they need.

In April, Twitter announced that, after reviewing the more than 1,300 proposals submitted from more than 60 different countries, it had selected six institutions to provide with data access. Projects approved included a study of foodborne gastrointestinal illnesses, a study measuring happiness levels in cities based on images shared on Twitter, and a study using geosocial intelligence to model urban flooding in Jakarta, Indonesia. There's even a project exploring the relationship between tweets and sports team performance.

Twitter did not directly respond to our questions on Tuesday afternoon regarding the specific amount and types of data the company is providing to the six institutions. But in its privacy policy, Twitter explains that most user information is intended to be broadcast widely. As a result, the company likely believes that sharing such information with scientific researchers is well within its rights, as its services "are primarily designed to help you share information with the world," Twitter says. "Most of the information you provide us is information you are asking us to make public."

While mining such data sets will undoubtedly aid scientists in conducting experiments for which similar data was previously either unavailable or quite limited, these applications raise some legal and ethical questions. For example, Scientific American has asked whether Twitter will be able to retain any legal rights to scientific findings and whether mining tweets (many of which are not publicly accessible) for scientific research when Twitter users have not agreed to such uses is ethically sound.

In response, computational epidemiologists Caitlin Rivers and Bryan Lewis have proposed guidelines for ethical research practices when using social media data, such as avoiding personally identifiable information and making all the results publicly available.

Whether or not Twitter and the scientists who have access to the company's data follow such guideposts, one thing is clear—the trend of mining social media data for various purposes won’t be tapering off anytime soon.

19 Reader Comments

In response, computational epidemiologists Caitlin Rivers and Bryan Lewis have proposed guidelines for ethical research practices when using social media data, such as avoiding personally identifiable information and making all the results publicly available.

Didn't the Netflix challenge prove once-and-for-all that there is no such thing as non-personally identifiable information?

Mining twitter data makes sense and can be quite useful in many ways. Mining data from a site that isn't supposed to be public is all together different. Twitter is a share with the world platform. So, twitter should make non-private data available. The research should prove interesting. @sajeffe on twitter

Mining twitter data makes sense and can be quite useful in many ways. Mining data from a site that isn't supposed to be public is all together different. Twitter is a share with the world platform. So, twitter should make non-private data available. The research should prove interesting. @sajeffe on twitter

Not necessarily. If you ask me, tweets sent while protected as well as any and all DMs would not be something that was meant to be shared with the world. This whole time there has been no mention of that, which is a bit disconcerting. Other than those points, I would totally agree.

What happens if Twitter is not the platform on which I choose to represent my happiness? Outside of that point, a study of this type is based on the fact that you assume to know what happiness is, which is completely subjective to the individual and their situation. For example, one may feel happy when they represent their ideological point of view and feel as thought they are affecting some social change in society. On the other hand, happiness may be seen as a byproduct of agreement, where an class of individuals comes together in a common view; like omg that cat is so cute. My question is, is this happiness? Or the representation of obscured unhappiness? When strive to be ourselves, which may not appear to be generally happy, but by representing a subject that one is interested in, is that not happiness? As opposed to always being on the search for what may validate our feeling of acceptance though what our social structure sees as true happiness. So truly this complexity undermines this style of analysis, which leaves the study limited to those that use Twitter to share their personal life, in which they believe their happiness is this representation.

So sure, people are happy, but lets go buy some shit on twitter adds, because we are not happy enough

What happens if Twitter is not the platform on which I choose to represent my happiness?

I think that if enough people are using twitter to represent their happiness/joy then you can get a statistically significant data from it even if there are many people who don't use it that way.

I believe you are missing my main point, you can not pull statistical data on an undefined value, such as happiness. Is a smile happiness? What if one is smiling in a photo, they are from Seattle but the photo was taken in Portland? That is not representational of the correct city, and one may ask is the smile even a representation of happiness or simply a response to a camera? It is fuzzy, clouded, and represents a people much in the same way the electoral college represents the whole of america.

What happens if Twitter is not the platform on which I choose to represent my happiness?

I think that if enough people are using twitter to represent their happiness/joy then you can get a statistically significant data from it even if there are many people who don't use it that way.

I believe you are missing my main point, you can not pull statistical data on an undefined value, such as happiness. Is a smile happiness? What if one is smiling in a photo, they are from Seattle but the photo was taken in Portland? That is not representational of the correct city, and one may ask is the smile even a representation of happiness or simply a response to a camera? It is fuzzy, clouded, and represents a people much in the same way the electoral college represents the whole of america.

You're touching on a couple problems that plague social sciences in general. First, Twitter is a fairly big (although biased) sample of people, and the tweets are a fairly big (although again, biased) sample of their communications. Researchers either find clever ways to correct for the sampling biases using instrumental variables, Heckman procedures, zero-inflated distributions and other black magic... or they just note that the results apply to Twitter and let the reader judge it's value.

On the second note, there are technical definitions for various facets of happiness.

What happens if Twitter is not the platform on which I choose to represent my happiness?

I think that if enough people are using twitter to represent their happiness/joy then you can get a statistically significant data from it even if there are many people who don't use it that way.

I believe you are missing my main point, you can not pull statistical data on an undefined value, such as happiness. Is a smile happiness? What if one is smiling in a photo, they are from Seattle but the photo was taken in Portland? That is not representational of the correct city, and one may ask is the smile even a representation of happiness or simply a response to a camera? It is fuzzy, clouded, and represents a people much in the same way the electoral college represents the whole of america.

You're touching on a couple problems that plague social sciences in general. First, Twitter is a fairly big (although biased) sample of people, and the tweets are a fairly big (although again, biased) sample of their communications. Researchers either find clever ways to correct for the sampling biases using instrumental variables, Heckman procedures, zero-inflated distributions and other black magic... or they just note that the results apply to Twitter and let the reader judge it's value.

On the second note, there are technical definitions for various facets of happiness.

Exactly! Even with various mathematical equations that attempt to quantify "some emotional concept" the true causation becomes complex to quantify. There are many avenues one can take such a study, but focusing purely on happiness is a very shallow venture.

Of course we are happy, we live in a developed nation, and we are active internet users. Just because we are happy with ourselves does not mean we are happy about everything in the world.

What happens if Twitter is not the platform on which I choose to represent my happiness?

I think that if enough people are using twitter to represent their happiness/joy then you can get a statistically significant data from it even if there are many people who don't use it that way.

I believe you are missing my main point, you can not pull statistical data on an undefined value, such as happiness. Is a smile happiness? What if one is smiling in a photo, they are from Seattle but the photo was taken in Portland? That is not representational of the correct city, and one may ask is the smile even a representation of happiness or simply a response to a camera? It is fuzzy, clouded, and represents a people much in the same way the electoral college represents the whole of america.

You're touching on a couple problems that plague social sciences in general. First, Twitter is a fairly big (although biased) sample of people, and the tweets are a fairly big (although again, biased) sample of their communications. Researchers either find clever ways to correct for the sampling biases using instrumental variables, Heckman procedures, zero-inflated distributions and other black magic... or they just note that the results apply to Twitter and let the reader judge it's value.

On the second note, there are technical definitions for various facets of happiness.

Exactly! Even with various mathematical equations that attempt to quantify "some emotional concept" the true causation becomes complex to quantify. There are many avenues one can take such a study, but focusing purely on happiness is a very shallow venture.

Of course we are happy, we live in a developed nation, and we are active internet users. Just because we are happy with ourselves does not mean we are happy about everything in the world.

As a social science researcher myself (and one interested in getting this data), I feel compelled to step in here.

I think you are looking at this from the wrong angle -- the social sciences rarely put math first. Our primary goal is usually to test theory. In social sciences, it is exceedingly rare to have a perfectly clean way of testing theory, however. For archival empirical researchers, the difficulty is finding a reasonable proxy for the constructs at hand, often referred to as construct validity. Furthermore, we need to do our best to maintain internal validity -- attempting to control for other possible reasons for why our theory could play out in the data. The latter is one of the weaker aspects of archival research (but a strength of experimental methods), as there are always numerous other theories that need to be controlled for.

If a project is properly theory driven, your concerns about not representing happiness will quite possibly be irrelevant. Data should be chosen based on the appropriateness of the source -- if I want to study the general population's mood in relation to some set of events interacted with distance from the event*, Twitter is a pretty appropriate source, since it is a large sample, I can look back in time to when the events took place, and it is a pretty broad sample. It would require the usual caveats of "this may not apply beyond the scope of the data source," of course, unless there is a good reason to argue otherwise (for instance, a validation of the variable against an experimental study or the like). But even so, it would generalize to a large population anyhow. And it's probably the best data you'll get to measure the construct, for the time being.

*Not my actual interest in the data, just an example. Also, I just use happiness as the construct because that's what the prior posts talk about. It would be better to focus on a specific, theoretically driven operationalization of happiness.

All of these social media companies turn over information or acquiesce to just about any government request no matter how oppressive or unreasonable. Privacy policies and terms of use change whenever convenient for them and without regard for their users. Maybe I have delusions of grandeur but I eventually shut down all my social media accounts because I felt I was giving them, and by "them" I mean everybody with a subpoena or a nickel, something substantially more valuable than what they were giving me in return.

What happens if Twitter is not the platform on which I choose to represent my happiness?

I think that if enough people are using twitter to represent their happiness/joy then you can get a statistically significant data from it even if there are many people who don't use it that way.

I believe you are missing my main point, you can not pull statistical data on an undefined value, such as happiness. Is a smile happiness? What if one is smiling in a photo, they are from Seattle but the photo was taken in Portland? That is not representational of the correct city, and one may ask is the smile even a representation of happiness or simply a response to a camera? It is fuzzy, clouded, and represents a people much in the same way the electoral college represents the whole of america.

You're not listening to what the man said: Statistical data can determine what "happiness" is in the study based on the parameters selected FOR the study by those conducting the study.

Happiness is subjective, but those who are happy behave similarly, so they'll have an aggregate behavior that can be averaged. I doubt it will be tweeting pictures of kittens. They'll probably use things like "Still looking for a job" or "cant' find a job" as indications that the person isn't happy to "Just had a kid!" or "Finally got a job!" as indications that a person IS happy.

It's also likely they'll glean the data and compare it to places where "happiness index" studies have already been done to verify the results.

It doesn't take a rocket scientist to determine the overall happiness trend in people in general. Any idiot can see when people are happy, or unhappy, as a group, and can even gauge the level of it for the time it lasts (happiness is a transitional state between other moods, after all, and not a fixed state). To say that this isn't possible to establish the precise level of happiness in one individual in the study is probably true, but then, that's not what the study is trying to do.

You've obviously never been involved in studies - creating one or conducting one. Studies don't CARE about the individual because a single data point is irrelevant. Put a bunch of them together and you have a statistical universe from which conclusions (usually pretty accurate, and always with a margin of error) can be drawn.

As for the article (to keep it topical instead of being all about your shortcomings in understanding how statistical data can be derived), my first take was "Oh, this is fine". Then I thought about WHAT they were going to do with the data and wondered what was the data they were going to give out. The second thought had more to do with privacy. As it turns out, the data is already out there so it was like releasing old photos of someone who had already publicly posted a bunch of selfies in a shower. They wouldn't be doing that if they didn't already want to be seen, so harm, no foul.

It'll be interesting to see what, if anything, comes of these studies. Since the data was public to begin with, I hope they release the results to the public as well.

It'll be interesting to see what, if anything, comes of these studies. Since the data was public to begin with, I hope they release the results to the public as well.

I'm curious to know how you reached that conclusion, since the article states the opposite:

Quote:

whether mining tweets (many of which are not publicly accessible) for scientific research when Twitter users have not agreed to such uses is ethically sound.

In fact, one of the linked documents notes that "Researchers should adhere to a user’s attempt to control his or her data by respecting privacy settings." To me, this reads like Twitter will provide public/protected/private tweets in their data dump simply because they have it.

It'll be interesting to see what, if anything, comes of these studies. Since the data was public to begin with, I hope they release the results to the public as well.

I'm curious to know how you reached that conclusion, since the article states the opposite:

Quote:

whether mining tweets (many of which are not publicly accessible) for scientific research when Twitter users have not agreed to such uses is ethically sound.

In fact, one of the linked documents notes that "Researchers should adhere to a user’s attempt to control his or her data by respecting privacy settings." To me, this reads like Twitter will provide public/protected/private tweets in their data dump simply because they have it.

Many types of research require access to data that those under study probably don't want made public, at least not in any personally identifiable way. This is not in any way new, and universities have infrastructure in place to deal with the problem (independent review boards, reasonable attempts at IT security, etc.). At the simplest level, surveys are kept under lock-and-key and respondents quoted in the paper are given fake names.

It is possible to de-personalize data and have a second group run the analysis, but this doesn't really help in practice: one can simply search for a six-word phrase and figure out who said it. Therefore the answer is not to keep the researchers in the dark, rather it is to control access to the data and ensure that the researchers are responsible about what they disclose.

As for Twitter's terms of use, the researchers are probably considered "service providers."

Service Providers: We engage service providers to perform functions and provide services to us in the United States and abroad. We may share your private personal information with such service providers subject to confidentiality obligations consistent with this Privacy Policy, and on the condition that the third parties use your private personal data only on our behalf and pursuant to our instructions.

That doesn't address in any way the previous poster's claim that the "data was public to begin with," which is what I asking.

The vast majority of what goes through Twitter is public, but not all of it.

The Library of Congress gets a feed of all public Twitter data, and has been quietly gathering suggestions from researchers on how best to organize that data for archival/research use.

So we are in agreement, then, that the data given to the researchers will include data that's supposed to be "private," yes? If that's the case, then there really is no difference between this and what Facebook does with their data - yet there's very little outcry here.

That doesn't address in any way the previous poster's claim that the "data was public to begin with," which is what I asking.

The vast majority of what goes through Twitter is public, but not all of it.

The Library of Congress gets a feed of all public Twitter data, and has been quietly gathering suggestions from researchers on how best to organize that data for archival/research use.

So we are in agreement, then, that the data given to the researchers will include data that's supposed to be "private," yes? If that's the case, then there really is no difference between this and what Facebook does with their data - yet there's very little outcry here.

Yes, except that you don't seem to realize that researchers get access to private data on a regular basis. The project is proposed, screened by an independent review board, screened again by whoever owns the data (which can include a background investigation for government agencies), and audited throughout the project. At any given time your bank or dating app or ISP or e-commerce site or gaming console or who-know-what-else is cooperating with researchers. In many cases (such as public health monitoring) there is no feasible way for an individual to opt-out. The deal is that everyone involved in the project respects the confidentiality of that data when publishing results. Things run off the rails when private data is re-purposed or when the collection was never properly screened in the first place (e.g., NSA snooping).

It's like discovering that your skin is crawling with microbes. Icky, but it's been that way for a very long time.