Massive Scrape of Twitter’s Friend Graph

We’ve taken the data down for the moment, at Twitter’s request. STAY CALM. They want to support research on the twitter graph, but feel that since this is users’ data there should be terms of use in place. We’ve taken the data down while those terms are formulated. I pass along from @ev: “Thank you for your patience and cooperation.”

The infochimps have gathered a massive scrape of the Twitter friend graph. Right now it weighs in at

about 2.7M users: we have most of the “giant component”

10M tweets

58M edges

(These and other details will be updated as further drafts are released. See below for technical info). This is still in rough, rough draft but this dataset is so amazingly rich we couldn’t help sharing it. We have not done all the double-checking we’d like, and the field order will change in the next (12/30) rev. We’ll also have a much larger dump of tweets off the public datamining feed.

The data is offline at the moment pending some TOS from twitter.com. If you’re interested in hearing when it’s released, follow the low-traffic @infochimps on twitter or look for a post here.

Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data.

THE FILES ARE HUGE. They will, in principle, work with anything that can import a spreadsheet-style TSV file. But if you try to load a 56-million row dataset into Excel it will burst into flames. So will most tools; even opening an explorer/finder window on the ripd/_xxxx directories will fill you with regret.

If you have access to a cluster with Hadoop and Pig it’s highly, highly recommended. Otherwise, the files will load straight into MySQL using LOAD DATA INFILE (and assumedly other DBs as well). Industrial-strength products such as Mathematica and Cytoscape will struggle, but can handle good-sized subsets. And don’t worry: besides featuring it on infochimps.org when we launch, once this dataset is mature we’ll move the raw data onto Amazon Public Datasets and archive.org. (At which point Amazon can serve as your Hadoop cluster, if you’ve got the lucre.)

Description of objects and fields:

All the files are Tab Separated (TSV) files.

Users:

Partial Users: 8.1 million sightings of 2.7M unique users. When you ask for a user’s following / friends list, or in the public timeline tweets, you get a partial listing of each user. This table lists each unique state observed: If @infochimps was seen on the 10th, the 15th, and the 16th, with resp. 80, 80 and 82 followers (everything else the same) you’ll get the twitter_user_partial records of the 10th and the 16th.Fields: [:id], id, :screen_name, :followers_count, :protected, :name, :url, :location, :description, :profile_image_url

Random Notes:

There may be inconsistent user data for all-numeric screen_names: seehttp://code.google.com/p/twitter-api/issues/detail?id=162
That is, the data in this scrape may commingle information on the user having screen_name ‘415’ with that of the user having id #415. Not much we can do bout it, but we plan to scrub that data later.