Python: Multiprocessing large files

I been working with a lot of very large files and it has become increasing obvious that using a single processor core is a major bottleneck to getting my data processed in a timely fashion. A MapReduce style algorithm seemed like the way to go, but I had a hard time finding a useful example. After a bit of hacking about I came up with the following code.

The basic algorithmic idea is to first read in a large chunk of lines from the file. These are then partitioned out to the available cores and processed independently. The new set of lines are then written to an output file or in this example just printed to the screen. Normally this would be tricky code to write, but python 2.7’s wonderful multiprocessing module handles all the synchronization for you.

Results should be overwritten at each loop, but after your write them to an outfile. If I extended the results list as you suggest this would mean that all the results are stored in memory which kinda defeats the purpose of this script. Overwriting results isn’t a bug, it’s a feature! 😉