There is a whole chapter in the [http://biopython.org/DIST/docs/tutorial/Tutorial.html Tutorial] ([http://biopython.org/DIST/docs/tutorial/Tutorial.pdf PDF]) on Bio.SeqIO, and although there is some overlap it is well worth reading in addition to this WIKI page. There is also the [http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html API documentation] (which you can read online, or from within Python with the help command).

There is a whole chapter in the [http://biopython.org/DIST/docs/tutorial/Tutorial.html Tutorial] ([http://biopython.org/DIST/docs/tutorial/Tutorial.pdf PDF]) on Bio.SeqIO, and although there is some overlap it is well worth reading in addition to this WIKI page. There is also the [http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html API documentation] (which you can read online, or from within Python with the help command).

Line 17:

Line 19:

== File Formats ==

== File Formats ==

−

This table lists the file formats that Bio.SeqIO can read and write, with the Biopython version where this was first supported. The format name is a simple lowercase string. Where possible we use the same name as [[bp:HOWTO:SeqIO#Formats|BioPerl's SeqIO]] and [http://emboss.sourceforge.net/docs/themes/SequenceFormats.html EMBOSS].

+

This table lists the file formats that Bio.SeqIO can read, write and index, with the Biopython version where this was first supported (or [[git]] to indicate this is supported in our latest in development code). The format name is a simple lowercase string. Where possible we use the same name as [[bp:HOWTO:SeqIO#Formats|BioPerl's SeqIO]] and [http://emboss.sourceforge.net/docs/themes/SequenceFormats.html EMBOSS].

{| border="1" cellpadding="4" cellspacing="0"

{| border="1" cellpadding="4" cellspacing="0"

Line 23:

Line 25:

|-

|-

! Format name

! Format name

−

! Reads

+

! Read

−

! Writes

+

! Write

+

! Index

! Notes

! Notes

+

|-

+

| abi

+

| 1.58

+

| No

+

| N/A

+

| Reads the ABI "Sanger" capillary sequence traces files, including the PHRED quality scores for the base calls. This allows ABI to FASTQ conversion. Note each ABI file contains one and only one sequence (so there is no point in indexing the file).

| This refers to the input [[bp:FASTA_sequence_format|FASTA file format]] introduced for Bill Pearson's FASTA tool, where each record starts with a ">" line. Resulting sequences have a generic alphabet by default.

| This refers to the input [[bp:FASTA_sequence_format|FASTA file format]] introduced for Bill Pearson's FASTA tool, where each record starts with a ">" line. Resulting sequences have a generic alphabet by default.

|-

|-

−

| fastq

+

| fastq-sanger or fastq

| 1.50

| 1.50

| 1.50

| 1.50

−

| [http://en.wikipedia.org/wiki/FASTQ_format FASTQ files] are a bit like FASTA files but also include sequencing qualities. In Biopython, "fastq" refers to Sanger style FASTQ files which encode PHRED qualities using an ASCII offset of 33. See also the incompatible "fastq-solexa" and "fastq-illumina" variants.

+

| 1.52

+

| [http://en.wikipedia.org/wiki/FASTQ_format FASTQ files] are a bit like FASTA files but also include sequencing qualities. In Biopython, "fastq" (or the alias "fastq-sanger") refers to Sanger style FASTQ files which encode PHRED qualities using an ASCII offset of 33. See also the incompatible "fastq-solexa" and "fastq-illumina" variants used in early Solexa/Illumina pipelines, Illumina pipeline 1.8 produces Sanger FASTQ.

|-

|-

| fastq-solexa

| fastq-solexa

| 1.50

| 1.50

| 1.50

| 1.50

−

| [http://en.wikipedia.org/wiki/FASTQ_format FASTQ files] are a bit like FASTA files but also include sequencing qualities. In Biopython, "fastq-solexa" refers to (early) Solexa/Illumina style FASTQ files which encode Solexa qualities using an ASCII offset of 64. See also what we call the "fastq-illumina" format.

+

| 1.52

+

| [http://en.wikipedia.org/wiki/FASTQ_format FASTQ files] are a bit like FASTA files but also include sequencing qualities. In Biopython, "fastq-solexa" refers to the original Solexa/Illumina style FASTQ files which encode Solexa qualities using an ASCII offset of 64. See also what we call the "fastq-illumina" format.

|-

|-

| fastq-illumina

| fastq-illumina

| 1.51

| 1.51

| 1.51

| 1.51

−

| [http://en.wikipedia.org/wiki/FASTQ_format FASTQ files] are a bit like FASTA files but also include sequencing qualities. In Biopython, "fastq-illumina" refers to recent Solexa/Illumina style FASTQ files (from pipeline version 1.3+) which encode PHRED qualities using an ASCII offset of 64. For ''good'' quality reads, PHRED and Solexa scores are approximately equal, so the "fastq-solexa" and "fastq-illumina" variants are almost equivalent.

+

| 1.52

+

| [http://en.wikipedia.org/wiki/FASTQ_format FASTQ files] are a bit like FASTA files but also include sequencing qualities. In Biopython, "fastq-illumina" refers to early Solexa/Illumina style FASTQ files (from pipeline version 1.3 to 1.7) which encode PHRED qualities using an ASCII offset of 64. For ''good'' quality reads, PHRED and Solexa scores are approximately equal, so the "fastq-solexa" and "fastq-illumina" variants are almost equivalent.

|-

|-

−

| genbank

+

| genbank or gb

| 1.43

| 1.43

| 1.48 / 1.51

| 1.48 / 1.51

+

| 1.52

| The [[bp:GenBank_sequence_format|GenBank or GenPept flat file format]]. Uses Bio.GenBank internally for parsing. Biopython 1.48 to 1.50 wrote basic GenBank files with only minimal annotation, while 1.51 onwards will also write the features table (see [http://bugzilla.open-bio.org/show_bug.cgi?id=2294 Bug 2294]).

| The [[bp:GenBank_sequence_format|GenBank or GenPept flat file format]]. Uses Bio.GenBank internally for parsing. Biopython 1.48 to 1.50 wrote basic GenBank files with only minimal annotation, while 1.51 onwards will also write the features table (see [http://bugzilla.open-bio.org/show_bug.cgi?id=2294 Bug 2294]).

|-

|-

Line 70:

Line 87:

| 1.47

| 1.47

| No

| No

+

| 1.52

| This refers to the IntelliGenetics file format, apparently the same as the [[bp:Mase_multiple_alignment_format|MASE alignment format]].

| This refers to the IntelliGenetics file format, apparently the same as the [[bp:Mase_multiple_alignment_format|MASE alignment format]].

| [[bp:Tab_sequence_format|Simple two column tab separated sequence files]], where each line holds a record's identifier and sequence. For example, this is used by Aligent's eArray software when saving microarray probes in a minimal tab delimited text file.

| [[bp:Tab_sequence_format|Simple two column tab separated sequence files]], where each line holds a record's identifier and sequence. For example, this is used by Aligent's eArray software when saving microarray probes in a minimal tab delimited text file.

|-

|-

Line 110:

Line 153:

| 1.50

| 1.50

| 1.50

| 1.50

+

| 1.52

| [[bp:Qual_sequence_format|Qual files]] are a bit like FASTA files but instead of the sequence, record space separated integer sequencing values as PHRED quality scores. A matched pair of FASTA and QUAL files are often used as an alternative to a single FASTQ file.

| [[bp:Qual_sequence_format|Qual files]] are a bit like FASTA files but instead of the sequence, record space separated integer sequencing values as PHRED quality scores. A matched pair of FASTA and QUAL files are often used as an alternative to a single FASTQ file.

+

|-

+

| uniprot-xml

+

| 1.56

+

| No

+

| 1.56

+

| UniProt XML format, successor to the plain text Swiss-Prot format.

|-

|-

|}

|}

Line 153:

Line 203:

</python>

</python>

−

Another common task is to index your records by some identifier. For this we have a function '''Bio.SeqIO.to_dict()''' to turn a [[SeqRecord]] iterator (or list) into a dictionary:

+

Another common task is to index your records by some identifier. For small files we have a function '''Bio.SeqIO.to_dict()''' to turn a [[SeqRecord]] iterator (or list) into a dictionary (in memory):

<python>

<python>

Line 165:

Line 215:

The function '''Bio.SeqIO.to_dict()''' will use the record ID as the dictionary key by default, but you can specify any mapping you like with its optional argument, '''key_function'''.

The function '''Bio.SeqIO.to_dict()''' will use the record ID as the dictionary key by default, but you can specify any mapping you like with its optional argument, '''key_function'''.

−

Finally the function '''Bio.SeqIO.to_alignment()''' can be used to turn a SeqRecord iterator (or list) into an alignment object - provided all the sequences are the same length:

+

For larger files, it isn't possible to hold everything in memory, so '''Bio.SeqIO.to_dict()''' is not suitable. Biopython 1.52 inwards includes the '''Bio.SeqIO.index()''' function for this situation, but you might also consider [[BioSQL]].

<python>

<python>

from Bio import SeqIO

from Bio import SeqIO

−

handle = open("opuntia.aln", "rU")

+

record_dict = SeqIO.index("example.fasta", "fasta")

−

alignment = SeqIO.to_alignment(SeqIO.parse(handle, "clustal"))

+

print record_dict["gi:12345678"] #use any record ID

−

handle.close()

+

−

for column in range(alignment.get_alignment_length()) :

+

−

print "%s column %i" % (alignment.get_column(column),column)

+

</python>

</python>

−

−

In the future it may be possible to do this directly via the Alignment object. See also the [[AlignIO|Bio.AlignIO]] module for working directly with Alignment objects.

Biopython 1.45 introduced another function, '''Bio.SeqIO.read()''', which like '''Bio.SeqIO.parse()''' will expect a handle and format. It is for use when the handle contains one and only one record, which is returned as a single [[SeqRecord]] object. If there are no records, or more than one, then an exception is raised:

Biopython 1.45 introduced another function, '''Bio.SeqIO.read()''', which like '''Bio.SeqIO.parse()''' will expect a handle and format. It is for use when the handle contains one and only one record, which is returned as a single [[SeqRecord]] object. If there are no records, or more than one, then an exception is raised:

Line 235:

Line 280:

sequences = SeqIO.parse(input_handle, "genbank")

sequences = SeqIO.parse(input_handle, "genbank")

−

SeqIO.write(sequences, output_handle, "fasta")

+

count = SeqIO.write(sequences, output_handle, "fasta")

output_handle.close()

output_handle.close()

input_handle.close()

input_handle.close()

+

+

print "Converted %i records" % count

</python>

</python>

−

Or more concisely, without explicitly closing the handles, just:

+

Or more concisely using the '''Bio.SeqIO.convert()''' function (in Biopython 1.52 or later), just:

The format method will take any output format supported by '''Bio.SeqIO''' where the file format can be used for a single record (e.g. "fasta", "tab" or "genbank").

+

The format method will take any output format supported by '''Bio.SeqIO''' where the file format can be used for a single record (e.g. "fasta", "tab" or "genbank"). Note that we don't recommend you use this for file output - using '''Bio.SeqIO.write()''' is faster and more general.

== Help! ==

== Help! ==

Line 424:

Line 471:

If you are having problems with '''Bio.SeqIO''', please join the discussion mailing list (see [[mailing lists]]).

If you are having problems with '''Bio.SeqIO''', please join the discussion mailing list (see [[mailing lists]]).

−

If you think you've found a bug, please report it on [http://bugzilla.open-bio.org/ bugzilla].

+

If you think you've found a bug, please report it on our [http://redmine.open-bio.org/projects/biopython bug tracker].

+

+

[[Category:Wiki Documentation]]

Latest revision as of 11:28, 23 March 2014

This page describes Bio.SeqIO, the standard Sequence Input/Output interface for BioPython 1.43 and later. For implementation details, see the SeqIO development page.

There is a whole chapter in the Tutorial (PDF) on Bio.SeqIO, and although there is some overlap it is well worth reading in addition to this WIKI page. There is also the API documentation (which you can read online, or from within Python with the help command).

Aims

Bio.SeqIO provides a simple uniform interface to input and output assorted sequence file formats (including multiple sequence alignments), but will only deal with sequences as SeqRecord objects. There is a sister interface Bio.AlignIO for working directly with sequence alignment files as Alignment objects.

Note that the inclusion of Bio.SeqIO (and Bio.AlignIO) in Biopython does lead to some duplication or choice in how to deal with some file formats. For example, Bio.Nexus will also read sequences from Nexus files - but Bio.Nexus can also do much more, for example reading any phylogenetic trees in a Nexus file.

My vision is that for manipulating sequence data you should try Bio.SeqIO as your first choice. Unless you have some very specific requirements, I hope this should suffice.

File Formats

This table lists the file formats that Bio.SeqIO can read, write and index, with the Biopython version where this was first supported (or git to indicate this is supported in our latest in development code). The format name is a simple lowercase string. Where possible we use the same name as BioPerl's SeqIO and EMBOSS.

Table 1: Bio.SeqIO supported file formats

Format name

Read

Write

Index

Notes

abi

1.58

No

N/A

Reads the ABI "Sanger" capillary sequence traces files, including the PHRED quality scores for the base calls. This allows ABI to FASTQ conversion. Note each ABI file contains one and only one sequence (so there is no point in indexing the file).

This refers to the input FASTA file format introduced for Bill Pearson's FASTA tool, where each record starts with a ">" line. Resulting sequences have a generic alphabet by default.

fastq-sanger or fastq

1.50

1.50

1.52

FASTQ files are a bit like FASTA files but also include sequencing qualities. In Biopython, "fastq" (or the alias "fastq-sanger") refers to Sanger style FASTQ files which encode PHRED qualities using an ASCII offset of 33. See also the incompatible "fastq-solexa" and "fastq-illumina" variants used in early Solexa/Illumina pipelines, Illumina pipeline 1.8 produces Sanger FASTQ.

fastq-solexa

1.50

1.50

1.52

FASTQ files are a bit like FASTA files but also include sequencing qualities. In Biopython, "fastq-solexa" refers to the original Solexa/Illumina style FASTQ files which encode Solexa qualities using an ASCII offset of 64. See also what we call the "fastq-illumina" format.

fastq-illumina

1.51

1.51

1.52

FASTQ files are a bit like FASTA files but also include sequencing qualities. In Biopython, "fastq-illumina" refers to early Solexa/Illumina style FASTQ files (from pipeline version 1.3 to 1.7) which encode PHRED qualities using an ASCII offset of 64. For good quality reads, PHRED and Solexa scores are approximately equal, so the "fastq-solexa" and "fastq-illumina" variants are almost equivalent.

genbank or gb

1.43

1.48 / 1.51

1.52

The GenBank or GenPept flat file format. Uses Bio.GenBank internally for parsing. Biopython 1.48 to 1.50 wrote basic GenBank files with only minimal annotation, while 1.51 onwards will also write the features table (see Bug 2294).

Qual files are a bit like FASTA files but instead of the sequence, record space separated integer sequencing values as PHRED quality scores. A matched pair of FASTA and QUAL files are often used as an alternative to a single FASTQ file.

uniprot-xml

1.56

No

1.56

UniProt XML format, successor to the plain text Swiss-Prot format.

With Bio.SeqIO you can treat sequence alignment file formats just like any other sequence file, but the new Bio.AlignIO module is designed to work with such alignment files directly. You can also convert a set of SeqRecord objects from any file format into an alignment - provided they are all the same length. Note that when using Bio.SeqIO to write sequences to an alignment file format, all the (gapped) sequences should be the same length.

Sequence Input

The main function is Bio.SeqIO.parse() which takes a file handle and format name, and returns a SeqRecord iterator. This lets you do things like:

In the above example, we opened the file using the built-in python function open. The argument 'rU' means open for reading using universal readline mode - this means you don't have to worry if the file uses Unix, Mac or DOS/Windows style newline characters.

Note that you must specify the file format explicitly, unlike BioPerl's SeqIO which can try to guess using the file name extension and/or the file contents. See Explicit is better than implicit (The Zen of Python).

If you had a different type of file, for example a Clustalw alignment file such as 'opuntia.aln' which contains seven sequences, the only difference is you specify "clustal" instead of "fasta":

Iterators are great for when you only need the records one by one, in the order found in the file. For some tasks you may need to have random access to the records in any order. In this situation, use the built in python list function to turn the iterator into a list:

The function Bio.SeqIO.to_dict() will use the record ID as the dictionary key by default, but you can specify any mapping you like with its optional argument, key_function.

For larger files, it isn't possible to hold everything in memory, so Bio.SeqIO.to_dict() is not suitable. Biopython 1.52 inwards includes the Bio.SeqIO.index() function for this situation, but you might also consider BioSQL.

Biopython 1.45 introduced another function, Bio.SeqIO.read(), which like Bio.SeqIO.parse() will expect a handle and format. It is for use when the handle contains one and only one record, which is returned as a single SeqRecord object. If there are no records, or more than one, then an exception is raised:

There are more examples in the following section on converting between file formats.

Note that if you are writing to an alignment file format, all your sequences must be the same length.

If you supply the sequences as a SeqRecord iterator, then for sequential file formats like Fasta or GenBank, the records can be written one by one. Because only one record is created at a time, very little memory is required. See the example below filtering a set of records.

On the other hand, for interlaced or non-sequential file formats like Clustal, the Bio.SeqIO.write() function will be forced to automatically convert an iterator into a list. This will destroy any potential memory saving from using an generator/iterator approach.

File Format Conversion

Suppose you have a GenBank file which you want to turn into a Fasta file. For example, lets consider the file 'cor6_6.gb' which is included in the Biopython unit tests under the GenBank directory.

You could read the file like this, using the Bio.SeqIO.parse() function:

I'm not convinced this is actually any easier to understand, but it is shorter.

However, if you are using Python 2.4 or later, and you are dealing with very large files with thousands of records, you could benefit from using a generator expression instead. This avoids creating the entire list of desired records in memory:

Remember that for sequential file formats like Fasta or GenBank, Bio.SeqIO.write() will accept a SeqRecord iterator. The advantage of the code above is that only one record will be in memory at any one time.

However, as explained in the output section, for non-sequential file formats like Clustal write is forced to automatically turn the iterator into a list, so this advantage is lost.

If this is all confusing, don't panic and just ignore the fancy stuff. For moderately sized datasets having too many records in memory at once (e.g. in lists) is probably not going to be a problem.

Using the SEGUID checksum

In this example, we'll use Bio.SeqIO with the Bio.SeqUtils.CheckSum module (in Biopython 1.44 or later). First of all, we'll just print out the checksum for each sequence in the GenBank file ls_orchid.gbk:

from Bio import SeqIO
from Bio.SeqUtils.CheckSumimport seguid
for record in SeqIO.parse(open("ls_orchid.gbk"), "genbank") :
print record.id, seguid(record.seq)

Now lets use the checksum function and Bio.SeqIO.to_dict() to build a SeqRecord dictionary using the SEGUID as the keys. The trick here is to use the python lambda syntax to create a temporary function to get the SEGUID for each SeqRecord - we can't use the seguid function directly as it only works on Seq objects or strings.

Random subsequences

This script will read a Genbank file with a whole mitochondrial genome (e.g. the tobacco mitochondrion, Nicotiana tabacum mitochondrionNC_006581), create 500 records containing random fragments of this genome, and save them as a fasta file. These subsequences are created using a random starting points and a fixed length of 200.

Writing to a string

Sometimes you won't want to write your SeqRecord object(s) to a file, but to a string. For example, you might be preparing output for display as part of a webpage. If you want to write multiple records to a single string, use StringIO to create a string-based handle. The Tutorial (PDF) has an example of this in the SeqIO chapter.

For the special case where you want a single record as a string in a given file format, Biopython 1.48 added a new format method:

The format method will take any output format supported by Bio.SeqIO where the file format can be used for a single record (e.g. "fasta", "tab" or "genbank"). Note that we don't recommend you use this for file output - using Bio.SeqIO.write() is faster and more general.

Help!

If you are having problems with Bio.SeqIO, please join the discussion mailing list (see mailing lists).