Nextgen in Bioperl

This is a page for developer and user discussion of next-generation sequencing support in BioPerl. Please comment freely to help create priorities and use cases for development going forward. Thanks to all for your contributions!

Wish List

Improved support for fastq

After a bit of discussion on the mailing list the consensus so far seems to be to support current next-gen formats within SeqIO utilizing the same naming convention used in BioPython, i.e.:

"fastq" in Biopython means the original Sanger standard FASTQ files encoding PHRED qualities using an ASCII offset of 33.

"fastq-solexa" in Biopython means the early Solexa/Illumina style FASTQ files which encode Solexa qualities using an ASCII offset of 64.

"fastq-illumina" in Biopython will mean recent Solexa/Illumina style FASTQ files (from pipeline version 1.3+) which encode PHRED qualities using an ASCII offset of 64. This is in the Biopython repository, but hasn't been released yet - so the name "fastq-illumina" isn't set in stone yet.

Although performance is not optimal for the large number of reads that need to be dealt with usually, for the moment a "standard" implementation will be used. Potential improvements in the future might be to provide light-weight versions which avoid or reduce object creation, utilize C, etc.

Provide if possible some level of validation for the quality values of the format, i.e. check bounds during the parse and warn if they are exceeded