For this sprint, I produced a python package containing software for efficiently processing Wikipedia dumps using python. The software maps a function over the pages in a set of XML database dumps. It is...

easy to work with because the interface is an iterator over streaming page data that can be looped

uses little memory because it takes advantage of the efficiency of stream-reading XML in a sax parser and

Since python uses a Global Interpreter Lock, threading does not take advantage of multiple cores on a processing machine. To circumvent this problem, the multiprocessing package mimics threads via a the subprocess forking interface. Through the interface, primitive thread safety mechanisms can be used to allow message passing between the processes.

This package creates a "Processor" for the number of available cores on the client machine and publishes a queue of dump files for each processor to process. Each processor's output is then serialized via a central output queue into a generator that can be used by the main process.

The resulting system can be passed a function for processing page-level data (and it's revisions).