Here’s a common problem: You ever want to add up a very large list (hundreds of megabytes) or grep through it, or other kind of operation that is embarrassingly parallel? Data scientists, I am talking to you. You probably have about four cores or more, but our tried and true tools like grep, bzip2, wc, awk, sed and so forth are singly-threaded and will just use one CPU core. To paraphrase Cartman, “How do I reach these cores”? Let’s use all of our CPU cores on our Linux box by using GNU Parallel and doing a little in-machine map-reduce magic by using all of our cores and using the little-known parameter –pipes (otherwise known as –spreadstdin). Your pleasure is proportional to the number of CPUs, I promise. BZIP2 So, bzip2 is better compression than gzip, but it’s so slow! Put down the razor, we have the technology to solve this. Instead of this:

Especially with bzip2, GNU parallel is dramatically faster on multiple core machines. Give it a whirl and you will be sold. GREP If you have an enormous text file, rather than this:

grep pattern bigfile.txt

do this:

cat bigfile.txt | parallel --pipe grep 'pattern'

or this:

cat bigfile.txt | parallel --block 10M --pipe grep 'pattern'

These second command shows you using –block with 10 MB of data from your file — you might play with this parameter to find our how many input record lines you want per CPU core. I gave a previous example of how to use grep with a large number of files, rather than just a single large file. AWK Here’s an example of using awk to add up the numbers in a very large file. Rather than this:

This is more involved: the –pipe option in parallel spreads out the output to multiple chunks for the awk call, giving a bunch of sub-totals. These sub totals go into the second pipe with the identical awk call, which gives the final total. The first awk call has three backslashes in there due to the need to escape the awk call for GNU parallel. WC Want to create a super-parallel count of lines in a file? Instead of this:

wc -l bigfile.txt

Do this:

cat bigfile.txt | parallel --pipe wc -l | awk '{s+=$1} END {print s}'

This is pretty neat: What is happening here is during the parallel call, we are ‘mapping’ a bunch of calls to wc -l , generating sub-totals, and finally adding them up with the final pipe pointing to awk. SED Feel like using sed to do a huge number of replacements in a huge file? Instead of this:

Thank you for the thought – but from my testing, this use case works well. If you use the grep -A (after) or -B (before) features, it probably will break semantics if you have a boundary problem. But…for straightforward grep filtering this should work. By default GNU Parallel cuts records on ‘\n’ newlines. Do you have an example where grep gets broken?

Parallel is awesome! Just a note for Ubuntu fellows, if you don’t have parallel installed it will recommend the package moreutils. However, this comes with a different tool with different syntax, and will seem to fail silently. You should look for the package parallel when installing.

I tried this with bzip2, and was impressed with the results – until I went to decompress the file. The file is damaged, and will not uncompress. I would test this if you are going to use it. The line I used was:

I don’t think it’s valid to create a single file from the parallel bzips. Bzip uses a huffman encoding. Lets say byte a comes before byte b in the compressed file. Byte b will depend on a. When you run the compression in parallel like that, you can no longer guarantee that property. As a result, decompressing the file with a different number of splits or with different split locations will fail.

The only way this could work is if each bzip writes it’s own header and end of stream footers for each split and bzip understands what to do when it sees multiple bzip headers and footers.

The valid way to do this would be to have separate output files instead of a single one.

Thanks! It appears my double hyphens (–) are somehow converted to single hyphens or m-dashes or something on YOUR rendering. The command I provided does work – I just verified it. I don’t know why you had that weird double-to-single hyphen conversion happen. It is not happening on Google Chrome for me. It could have to do with a code highlight plug-in I am using….

Secondly – your point on Huffman encoding could stand, but at least in the implementation of bzip2 on Ubuntu 12.04, bzip2 1.0.6, I have generated random data up to 1GB in size, performed the parallel compression I describe, and then decompress and compare with the original. It all works fine. If you could generate an example that fails I would appreciate it, since I don’t know about the internals of Huffman encoding! Your analysis could stand, but my experiments show that the method works for large files.

Seconded that it works for me, but I need to test on files larger than 12 MB.

However due to the parallel nature it should be noted that the encoding will not be as efficient and compressed files will be slightly larger than it would if bzip2 was used by itself. Probably not a world of difference but I would be interested to see how different the sizes would be given a 1 GB file. The difference in compression of a 12 MB file is ~50 KB.

So now one problem to consider is that the compression is local to the parellel chunks. It’s not doing compression on the entire file, but instead on chunks of the file. This can result in larger files.

Note: This article only makes sense if you have SSD. If you have a traditional, rotating platter disk, the disk i/o time so dominates the runtime that having an effectively faster CPU just can’t make any difference.

This article is trying to demonstrate some basics of using multiple cores for various tasks – so take the lessons and expound on them. I do use SSDs, so your point makes sense – but depending on the types of computations you DO, it may be much more computationally expensive than a stupid little ‘wc’ of course.