A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery

Tuesday, December 18, 2012

The Trouble with FASTQ

I spend a lot of time working with sequencing data, and the most common format for such data is FASTQ. FASTQ has many things to appreciate, but FASTQ data also can be troublesome

For those who don't spend their days with such data, a bit of a historical backdrop. When I first was exposed to bioinformatics in the early 90s, there were many formats for sequence data. Pretty much every sequence analysis suite and public database had its own distinct format. The vast majority of these were designed for human readability first and computer parsing second, and tended to have text organized in columns by the use of carefully allotted spaces. Creating files in such formats was a dicey thing, because formal specifications of the formats were not common; you generally just tried to create something that looked like what you had seen and tried it out. That worked often, but failures were rampant. One of my first graduate bioinformatics projects involved a large-scale parse of features from GenBank, and I ended up spending a lot of time detecting and curating various inconsistencies.

FASTA format was a very different beast, strongly going for simplicity. Anyone could create a file in FASTA format, as the format was simply a line with a '>' followed by the sequence identifier and optionally a description, and then the sequence on the remaining lines.

>sequenceId Description goes here

ATCGATAGAGGGAGAGATAGATCCCCATGAGACCCGG

ATCAGAGGGGGATTTCCCTGGGGTGGGGGTTT

Again, formal specifications did not exist, so various programs would complain about variations within that theme. Some programs would tolerate a space between the angle bracket and the identifier; others would not. BLAST's database preparation program insisted that the sequence lines (other than the last for each sequence) have the same number of characters; other programs didn't care.

FASTA became the dominant open format for sequences, perhaps driven by BLAST using it, and perhaps by the simple fact that anyone with a text editor could create a file. When the idea of assigning quality values to DNA sequences arose (from phred, I believe), then the format was borrowed and adjusted to fit

>sequenceId Description goes here

10 20 12 13 14 18 17

Again, the trend for human readability, though it was even less likely any human would read this than the sequences. Most commonly, each sequencer trace would be converted to one FASTA file and one quality file. If a lot of files were a nuisance, then all the sequence files and all the quality files could each be concatenated into a pair of files, with care taken that the order of sequence ids was the same in each.

FASTQ was a clever leap forward, driven largely by the much larger volumes of next-gen data. FASTQ combines the two files into a single file, with alternating sequence and quality info. The really clever change was to eliminate the human readability of the quality values and instead go for a compact storage by converting each quality value to a character by a simple formula. Alas, two different encodings have been used, though only one really persists.

@HWI-ST1258:73:C0VWTACXX:8:1101:10726:2074

GTAATATGTGCTCTTGATCATATGTACATAAACTAA

+

#1=DDFFFHFGHHJJJJJJJFAFHIGICHIGGHHEF

Okay, so far this all is good. So what is the trouble with FASTQ? The trouble is the files have a tendency to multiple like tribbles!

Take a typical Illumina run. As I get the data back from my sequence provider, I have two files per sample, one for each read (for really big runs each set is split into multiple files). Many times, these have been compressed with either zip, gzip or bzip. On the UNIX side, I always unzip the zip files, and bzip files are a rarity, but gzip files can sometimes be left zipped. The key is what program downstream is using them; many can take zipped files, others can't . Those that can differ on how they detect them; some require command-line switches and others just recognize the canonical .gz extension.

Now, some programs can use that paired end information, but those that do differ in how they want it. One set of programs demand that you give the two FASTQ files separately (with BWA this is typically in two separate command lines, one for each side; they are combined later in the process). But another set of program require you to shuffle the two together into an interleaved format, so that the first read from fragment 1 is followed by the other read from fragment 1 and so forth. Very few programs will accept either; the Ray assembler is a notable exception.

So, if you have to shuffle them you now have a second copy of all your FASTQ files. Before or after, perhaps you want to apply some trimming, filtering or error correction tools. Each of which is typically going to generate yet another version of the FASTQ files. If you are trying out various permutations or parameters of these various programs, it is not hard to accumulate a small zoo of FASTQ files. Well, not really a small zoo; they're all likely to be quite large. A few programs can at least output compressed data, but that isn't much use unless the next program will accept it.

Worse, if successive stages of your pipeline require different paired end formats, then you really have fun! For example, I was recently trying the MUSKET error corrector prior to applying the FLASH paired end combiner. MUSKET actually doesn't care about paired ends, but will only generate a single FASTQ file no matter how many inputs you give it. FLASH on the other hand needs the two reads in separate files. So, I interleaved the sequences for MUSKET, processed them and then de-interleaved them for FLASH. The first time through I generated the interleaved file but forgot to use it, meaning I fed FLASH essentially a random pairing of forward and reverse reads!! Until I'm done playing, it seems like each dataset has a growing number of slightly different FASTQs lying around. Once I try assembling them, that data ends up (one way or another) making more copies, as the read and quality data ends up in an assembly file or BAM file!

In an ideal world, perhaps I wouldn't need so many copies -- every program could accept every flavor of paired end read assortments, or better yet the reads would live once in a database and programs could extract the right version (with trimming and edits) on the fly. Of course, loading such a database might take a lot of time, and perhaps if I just settled on a pipeline I'd just clean out each intermediate once the next stage completes.

If you're not scared of being buried up to your neck in FASTQ (which, unlike tribbles, are neither cuddly nor make calming trills), Warp Drive is looking for someone like you. We're getting set to move to a larger starbase on the other side of Kendall Square, which opens the possibility of expanding our staff. We're looking for a computational biologist with experience in NGS data manipulation (ideally with a heavy emphasis on de novo assembly). Details on applying can be found on our website, though there isn't really a good job description there (yet).

11 comments:

Wouldn't it be better to run FLASH first, then MUSKET? I think FLASH will do some "error correction" when merging reads, it looks at quality scores in the overlapping region and adjust if one position is mismatched with low quality score of one base and higher on the other. If you first run error correction, you might chop off ends of reads that might have been rescued by FLASH.

You missed the bit about the different score encoding schemes. Sanger, Solexa, Illumina 1.3+ - the worst bit is that there is no obvious way of differentiating between them other than reading the scores and see which fits.

No, the fastq format should not have comment lines in ASCII. Otherwise, a fastq file will just be like a .xls file that biologists are using everyday: unorganized and messy. Information about projects, experiments, runs and samples should be upstream, like you can readily find on NCBI SRA, EBI ENA or DDBJ. README files are fine too.

Instead the format should just not be human readable (like SAM versus BAM with samtools view -). Computers are better at reading stuff when each entry has the same number of words. But biologists would not like that. The ASN.1 format has been around for a while (see http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html ), but nobody is using that really.

Obviously, you don't want to use XML or JSON or Lisp to represent DNA sequences because they are not block-based. You want to know where to go in the file for the ith sequence.

@OmicsOmicsBlog

The problem with genomics is that the common denominator is the file -- programs simply don't even try to communicate using IPC (Inter Process Communication), be it with sockets or something else more fancy like OFED.

Some programs even do message passing over the VFS (virtual file system), which is really a bad design because you're going through the block layer in many case and thereby touching rotating slow devices.

For genomics, a computer is essentially:

- a processor (or more than 1 processor), -a memory node (or more than 1 memory node in case of NUMA systems), - storage block devices,- a graphics unit for visualization,- a network character device

The truth is that the storage block devices are the slowest parts.

In physics, most of the tools use the HDF format. Pacific Biosciences cleverly use the HDF format too.

I think that the FastQ (with Sanger qualities) is doing a great job actually. All we need now is the obvious Binary FastQ file format (and a tool to convert FastQ -> Binary Fastq and the other way around).

The closest thing we have to Binary FastQ is the really cool BAM format.

You say "Information about projects, experiments, runs and samples should be upstream, like you can readily find on NCBI SRA, EBI ENA or DDBJ. README files are fine too."

I disagree with this. I would much rather have my metadata with my sequence data. README files can get detached from sequence file. Upstream sources are not always available. Getting a FastQ file fresh off the sequencer -- where is the upstream information?

I point out that the "really cool BAM" format has metadata in it. Some of the information (e.g., @SQ) is structured while other (e.g., @CO) is not.

I guess I really don't get the enthusiasm for binary files. Don't you people write scripts? You can't parse binary (well, not without custom libraries that may not exist for your preferred language and OS).

I have TXT files that I need an easy program to convert to fastq files. It would be great to have a program where you could simply browse and select a TXT file that would then automatically be converted to fastq. All I can find are Perl script options which for a non-programmer is difficult. Any tools or softwares you can recommend?

Without knowing more about the format of the TXT files, it is impossible to say & probably even harder to write a generic program.

If you want to advance in genomics, bite the bullet and dip into programming. I recommend Python as the language to try first; even if it doesn't stick, you'll benefit from the effort. I know everyone doesn't end up a programmer, but everyone should take a stab at it.

About Me

Dr. Robison spent 10 years at Millennium Pharmaceuticals working with various genomics & proteomics technologies & working on multiple teams attempting to apply these throughout the drug discovery process. He spent 2 years at Codon Devices working on a variety of protein & metabolic engineering projects as well as monitoring a high-throughput gene synthesis facility. After a brief bit of consulting, he rejoined the cancer drug discovery field at Infinity Pharmaceuticals in May 2009. In September 2011 he joined Warp Drive Bio, a startup applying genomics to natural product drug discovery. Other recurring characters in this blog are his loyal Shih Tzu Amanda and his teenaged son alias TNG (The Next Generation).
Dr. Robison can be reached via his Gmail account, keith.e.robison@gmail.com
You can also follow him on Twitter as @OmicsOmicsBlog.