Contents

hadoop21@beta:~/wikihadoop/diffs$ /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-10-bzip2/part-00000 | head -n 3
133350337 11406585 0 'National security and homeland security presidential directive' 1180070193 u'Begin' False 308437 u'Badagnani' 0:1:u"The '''[[National Security and Homeland Security Presidential Directive]]''' (NSPD-51/HSPD-20), signed by President [[George W. Bush]] on May 9, 2007, is a [[Presidential Directive]] giving the [[President of the United States]] near-total control over the United States in the event of a catastrophic event, without the oversight of [[United States Congress|Congress]].\n\nThe signing of this Directive was generally unnoticed by the U.S. media as well as the U.S. Congress. It is unclear how the National Security and Homeland Security Presidential Directive will reconcile with the [[National Emergencies Act]], signed in 1976, which gives Congress oversight during such emergencies.\n\n==External links==\n*[http://www.whitehouse.gov/news/releases/2007/05/20070509-12.html National Security and Homeland Security Presidential Directive], from White House site\n\n==See also==\n*[[National Emergencies Act]]\n*[[George W. Bush]]\n\n{{US-stub}}"
133350707 11406585 0 'National security and homeland security presidential directive' 1180070344 None False 308437 u'Badagnani' 906:1:u'National Security Directive]]\n*[['
133350794 11406585 0 'National security and homeland security presidential directive' 1180070386 None False 308437 u'Badagnani' 613:-1:u'signed' 613:1:u'a U.S. federal law passed'

Each row represents a revision from the April, 2011 XML dump of the English Wikipedia. There *should* be a row for every revision that wasn't deleted when that dump was produced; however at this time, some cleanup will need to be done to remove duplicates and fill in missing revision diffs.

rev_id: The identifier of the revision being described PRIMARY KEY

page_id: The identifier of the page being revised

namespace: The identifier of the namespace of the page

title: The title of the page being revised

timestamp: The time the revision took place as a Unix epoch timestamp in seconds

With 3 nodes and 24 cores in total, one dump file of EN wiki approximately takes 20-24 hours to process.

If you want to extract the dataset as an ordinary file, accumulate the dataset rows into one file (diffs.tsv.gz) using /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-*/part-* > diffs.tsv.

There are some duplicates in the results [16]. If you want to exclude those duplicates, use /usr/lib/hadoop-beta/bin/hdfs dfs -cat /usr/hadoop/out-*/part-* | sort -n -k2 -k1 -u -T ~/tmp/ > diffs.tsv instead. Note that ~/tmp needs to be a directory large enough to contain all the results shown with /usr/lib/hadoop-beta/bin/hdfs dfs -du /usr/hadoop/out-*/part-*.

This may take several hours~one day depending on the size. It will be than 400 GB for EN wiki.