Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:

There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers.

For example, below is an excert of Wiki-syntax for a page on data mining.

I was epicly worried that I would spend weeks writing my own parser and never complete the project I am working on at work. To my surprise, I found a fairly good parser. Since I am working on named entity extraction and ngram extraction, I wanted to only extract the plain text. If we take the above junk and extract only the plain text, we would get

Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young
and interdisciplinary field of computer science is the process of discovering new patterns from large data sets
involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems.
The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves
database and data management, data preprocessing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of found structure, visualization and online updating.

and from this we can remove punctuation (except sentence terminators .?!), convert to lower case and perform other pre-processing text mining steps. There are many, many Wikipedia parsers of various qualities. Some do not work at all, some work only on certain articles, some have been abandoned as incomplete and some are slow as molasses.

Is not perfect, but one of the best I have seen. For some reason, Wikilinks are converted to HTML links. Correcting this required modifying the source code.

Retooling the package to work with Hadoop Streaming is not too difficult, but requires some work and grokery that should be easier.

Wikipedia Extractor is good for offline analysis, but users will probably want something that runs faster. Wikipedia Extractor parsed the entire Wikipedia dump in approximately 13 hours, on one core, which is quite painful. Add in further parsing and the processing time becomes unbearable even on multiple cores. A Hadoop Streaming job using Wikipedia Extractor as well as too much file I/O between Elastic MapReduce and S3 required 10 hours to complete on 15 c1.medium instances.

Ken Weiner (@kweiner) recently re-introduced me to the Cloud9 package by Jimmy Lin (@lintool) of Twitter which fills in some of these gaps. I avoided it at first because Java is not the first language I like to turn to. Cloud9 is written in Java and designed for use with Hadoop MapReduce in mind. There is a method within the package that explicitly extracts the body text of each Wikipedia article. This method calls the Bliki Wikipedia parsing library. One common problem with these Wikipedia parsers is that they often leave syntax still in the output. Jimmy seems to wrap Bliki with his own code to do a better job of extracting high quality text only output. Cloud9 also has counters and functions that detect non-article content such as redirects, disambiguation pages, and more.

Developers can introduce their own analysis, text mining and NLP code to process the article text in the mapper or reducer code. An example job distributed with Cloud9 which simply counts the number of pages in the corpus took approximately 15 minutes to run on 8 cores on an EC2 instance. A job that did more substantial required 3 hours to complete, and once the corpus was refactored as sequence files, the same job took approximately 90 minutes to run.

Conclusion

I am looking forward to playing with Cloud9 some more… I will take 90 minutes over 10 hours any day! Wikipedia Extractor is an impressive Python package that does a very good job of extracting plain text from Wikipedia articles and for that I am grateful. Unfortunately, it is far too slow to be used on a pay-per-use system such as AWS or for quick processing. Cloud9 is a Java package designed with scalability and MapReduce in mind, allowing much quicker and more wallet friendly processing.

You might want to check out Google/Freebase’s weekly WEX dumps. They’ve done a bunch of the grunt work and publish the results on a regular basis. In the past they’ve made them available on EC2, which would save you the bandwidth charges, although I’m not sure they still do that on a regular basis.

This is very cool indeed! One way to speed up this significantly is to use Wikihadoop and use your existing Python code for the mapper. Wikihadoop, in contrast to the Cloud9 package, is able to stream the full bzipped2 XML dump files using Hadoop Streaming. I am happy to help you if you are stuck. You can find Wikihadoop at: https://github.com/whym/wikihadoop

Thanks! We tried WikiHadoop but it did not seem very generalizable. The authors seemed to only be familiar with using it for diffing revisions. It could be an extremely powerful project if the documentation were better and if it was not restricted to Hadoop 0.21+.

Awesome! I will send you an email when I get a chance. WikiHadoop has the potential to be extremely useful. Cloud9 was great, but since Java is not my preferred language, it was a pain to setup at first.

I’ve struggled a bit trying to get data from Wikipedia. For example, I’d love to get a plain text data dump of all the Wikis, edit histories and discussion pages on S&P 500 companies or of all the Wikis under “Companies founded in year XYZ”

It would be interesting to see if older companies have longer Wikis or do some kind of sentiment analysis on S&P 500 companies or how frequent changes are made of what byte size.

Any tips on how I’d get started trying to get the data for this? I’m technical enough to know PHP, but I don’t know – this might be beyond me.

After attempts to parse the Wikipedia dump myself, I end up experimenting with DBpedia data (http://wiki.dbpedia.org/Downloads37) instead. The DBpedia data includes (after cleanup) wikipedia article title, abstract, categories, redirects, disambiguation, which might be enough for my use.

I am just wondering why DBpedia didnot extract the fulltext article content, but only the abstract.

As I am still half way playing with the DBpedia data, no conclusions can be made with regards to whether it has enough info for me.

Expecting to see more efforts in this space to make Wikipedia data more accessible for programmers, especially python geeks.

Not sure how it compares, but a while back I wrote some tokenizers/token filters for Lucene that work on Wikipedia. They aren’t perfect, but if you know Lucene, it may not be too hard to extend them for your needs. Naturally, you can then feed them into Lucene’s n-gram capabilities and other filters to build up what you need.