Thursday, 10 November 2011

Today I took FASTQ file with 3.5M reads, which was Read1 from a paired-end Illumina 100bp run - it was about 883Mb in size. As many have shown before me, GZIP compresses to about 1/4 the size, and BZIP2 about 1/5.

883252 R1.fastq

233296 R1.fastq.gz

182056 R1.fastq.bz2

I then split the read file into 3 separate files: (1) The ID line, but with the mandatory '@' removed, (2) the sequence line, but uppercased for consistency, and (3) the quality line unchanged. It ignored the 3rd line of each FASTQ entry, as it is redundant. This knocked 1% off the total size.

189588 id.txt

341756 seq.txt

341756 qual.txt

873100 TOTAL

Now, I compressed each of the three streams (ID, Sequence, Quality) independently with GZIP. The idea is that these dictionary-based compression schemes will work better on more homogeneous data streams, than when they are interleaved in one stream. As you can see this does improve things by about 15%, but still not as good as BZIP2 without de-interleaving.

20608 id.txt.gz

84096 qual.txt.gz

102040 seq.txt.gz

206644 TOTAL (was 233296 combined)

If we use BZIP2 to compress the interleaved stream, it does only 5% better than when it was a single stream. This is testament to BZIP2's ability to cope with heterogeneous data streams better than GZIP.

16560 id.txt.bz2

66812 qual.txt.bz2

93564 seq.txt.bz2

176936 TOTAL (was 182056 combined)

So in summary, we've re-learnt that BZIP2 is better than GZIP, and that they are both doing quite well adapting to the three interleaved data types in a FASTQ file.