This page is aimed at any developers or coders interesting in understanding or extending the new Sequence Input/Output interface for BioPython, [[SeqIO]].

This page is aimed at any developers or coders interesting in understanding or extending the new Sequence Input/Output interface for BioPython, [[SeqIO]].

−

Some code has now been checked into [http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/?cvsroot=biopython#dirlist CVS], and other bits are available on [http://bugzilla.open-bio.org/show_bug.cgi?id=2059 Bug 2059]. Details are currently being discussed on the [http://biopython.org/wiki/Mailing_lists Development mailing list].

+

The code has now been checked into [http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/?cvsroot=biopython#dirlist CVS]. Related [http://bugzilla.open-bio.org/show_bug.cgi?id=2059 Bug 2059] has been resolved.

−

If all goes well, the code will be available in the next release, probably BioPython 1.43.

+

The code is already available in BioPython 1.43.

−

== Adding new file formats ==

+

== Reading new file formats ==

'''Note:''' The details are still subject to change

'''Note:''' The details are still subject to change

Line 21:

Line 21:

When storing any annotations in the record's annotations dictionary follow the defacto standard laid down by the GenBank parser... I should try and document this more.

When storing any annotations in the record's annotations dictionary follow the defacto standard laid down by the GenBank parser... I should try and document this more.

−

To add support for writing a new file format you should write a sub class of one of the writer objects in Bio.SeqIO.Interfaces

+

If the supplied file seems to be invalid, raise a ValueError exception.

−

Then, the new format must be added to the relevant dictionary mappings in Bio/SeqIO/__init__.py so that the '''Bio.SeqIO.parse''' and '''Bio.SeqIO.write''' functions are aware of it.

+

Finally, the new format must be added to the relevant dictionary mapping in Bio/SeqIO/__init__.py so that the '''Bio.SeqIO.parse()''' and '''Bio.SeqIO.read()''' functions are aware of it.

−

== Possible additional formats ==

+

== Writing new file formats ==

−

There are existing parsers in BioPython for the following file formats, which could be integrated into Bio.SeqIO if appropriate.

+

'''Note:''' The details are still subject to change

−

=== NBRF / PIR format ===

+

To add support for writing a new file format you should write a sub class of one of the writer objects in Bio.SeqIO.Interfaces

−

Bio.NBRF has a Martel parser for this file format, which is similar to the FASTA format. It would need addition work to return SeqRecords. It might be easier to extend to reuse the Bio.SeqIO fasta code instead.

+

Then, the new format must be added to the relevant dictionary mappings in Bio/SeqIO/__init__.py so that the '''Bio.SeqIO.write()''' function is aware of your code.

−

http://www.bioperl.org/wiki/PIR_sequence_format

+

If the supplied records cannot be written to this file format, raise a ValueError exception. Where appropriate, please use the following wording:

−

http://www.psc.edu/general/software/packages/seq-intro/nbrffile.html

+

<python>

+

raise ValueError("Must have at least one sequence")

+

raise ValueError("Sequences must all be the same length")

+

raise ValueError("Duplicate record identifier: %s" % ...)

+

...

+

</python>

−

=== KEGG format ===

+

ToDo - Defined standard exceptions in Bio.SeqIO itself?

−

Can Bio.KEGG parse these files?

+

== Possible additional formats ==

−

http://www.bioperl.org/wiki/KEGG_sequence_format

+

There are existing parsers in BioPython for the following file formats, which could be integrated into Bio.SeqIO or Bio.AlignIO if appropriate.

−

=== PHD sequencing files from PHRED ===

+

=== KEGG format ===

−

+

−

Bio.Sequencing.PHD has a Martel parser for this file format, also used by the tools PHRAP and CONSED.

+

−

+

−

http://www.bioperl.org/wiki/PHD_sequence_format

+

−

+

−

=== MASE alignment format ===

+

−

+

−

Bio.IntelliGenetics seems to use Martel parse MASE format files into its own record object. It could be extended to return SeqRecord objects. See:

+

−

http://pbil.univ-lyon1.fr/help/formats.html

+

Can Bio.KEGG parse files in [[bp:KEGG sequence format|KEGG format]]?

=== MEME format ===

=== MEME format ===

Line 63:

Line 60:

=== BLAST results ===

=== BLAST results ===

−

Pairwise alignments from the BLAST suite could be turned into two SeqRecord objects with gapped sequences. Is this useful?

+

Pairwise alignments from the BLAST suite could be turned into a pairwise Alignment object with Bio.AlignIO. Is this useful? Sample code on [http://bugzilla.open-bio.org/show_bug.cgi?id=2560 Bug 2560]

=== COMPASS pairwise alignment format ===

=== COMPASS pairwise alignment format ===

−

Bio.Compass can parse the pairwise alignments from COMPASS. The output is similar to BLAST in many ways. Again, is getting the results as SeqRecord objects useful?

+

Bio.Compass can parse the pairwise alignments from COMPASS. The output is similar to BLAST in many ways. Again, is getting the results as SeqRecord or pairwise alignment objects useful?

Revision as of 10:45, 24 March 2010

This page is aimed at any developers or coders interesting in understanding or extending the new Sequence Input/Output interface for BioPython, SeqIO.

The code has now been checked into CVS. Related Bug 2059 has been resolved.

An ordinary function which returns an iterator. For example, you could build a list of SeqRecords and then turn it into an iterator using the iter() function.

You may accept additional optional arguments (an alphabet for example). However there must be one and only one required argument (the input file handle).

What you use as the SeqRecord's id, name and description will depend on the file format. Ideally you would use the accesion number for the id. This id should also be unique for each record (unless the records in the file are in themselves ambiguous).

When storing any annotations in the record's annotations dictionary follow the defacto standard laid down by the GenBank parser... I should try and document this more.

If the supplied file seems to be invalid, raise a ValueError exception.

Finally, the new format must be added to the relevant dictionary mapping in Bio/SeqIO/__init__.py so that the Bio.SeqIO.parse() and Bio.SeqIO.read() functions are aware of it.

Writing new file formats

Note: The details are still subject to change

To add support for writing a new file format you should write a sub class of one of the writer objects in Bio.SeqIO.Interfaces

Then, the new format must be added to the relevant dictionary mappings in Bio/SeqIO/__init__.py so that the Bio.SeqIO.write() function is aware of your code.

If the supplied records cannot be written to this file format, raise a ValueError exception. Where appropriate, please use the following wording:

raiseValueError("Must have at least one sequence")raiseValueError("Sequences must all be the same length")raiseValueError("Duplicate record identifier: %s"% ...)
...

ToDo - Defined standard exceptions in Bio.SeqIO itself?

Possible additional formats

There are existing parsers in BioPython for the following file formats, which could be integrated into Bio.SeqIO or Bio.AlignIO if appropriate.