5.2. Introduction to Sequence Formats

5.2.1. What is a Sequence Format?

A sequence format defines the permitted layout and content of text in a file. This includes text tokens that define fields used in a databank. These fields include the sequence itself, the sequence identifier name and accession number, amongst others. Non-printable control characters are not generally used, allowing most formats to be viewed on screen or printed out.

The FASTA format is a very widely used (and abused) format. It consists of a header line starting with a > character followed by a code identifying the sequence and, very often, some text describing the sequence. The header line is followed by one or more lines containing the sequence itself. FASTA files may contain one or more sequences:

Sadly, sequences are occasionally stored in non-standard formats. These include proprietary word processor formats (e.g. MS Word and MS WordPad) and text formatting languages (e.g. PostScript, PDF, RTF, TeX and HTML). EMBOSS will not read a sequence in any of these formats.

If you have a sequence in a non-standard format you should:

Save the sequence to a file as plain ASCII text, without any formatting whatsoever. The file should contain the sequence only. EMBOSS will recognise this "plain" format. The program you are using to view the file should have an option to "Save as..." plain text.

If there is not an option to save your sequence in plain text format directly, there may well be a utility program to convert the file to plain text format. The EMBOSS user community will be able to help you with this (see Section 3.5, “How to Get Help”).

Use a text editor that is capable of writing files in plain text format in the future. These include pico, nedit, emacs and MS wordpad. When using a text editor to create a sequence file, the best (simplest) format to use is FASTA as described above. Be sure to save your sequence as plain text.

If you intend to manipulate or edit the sequences substantially, investigate using a full-blown sequence editor such as mse. Such editors should have an option to save the sequence to a file in one or more of the standard formats.

5.2.2. Supported Sequence Formats

Some sequence formats can hold multiple sequences in one file. Typically there will be multiple entries (one per sequence) that are catenated in the file. Other formats, such as Staden, can only hold one sequence per file. An attempt to catenate several such sequences in one file would result in a mess from which it would be difficult to differentiate the sequences from the annotation. Most systems including EMBOSS will not parse such files, therefore you should never use a single sequence format to hold multiple sequences. Sequences are also held in alignment files. These contain the results of aligning (lining up similar or equivalent characters) in two or more sequences. EMBOSS supports most common sequence alignments formats (Section A.3, “Supported Alignment Formats”).

All of the common sequence formats are supported in EMBOSS for both application input (reading) and output (writing). These are summarised below. Some support single sequences only, some multiple sequences. The names of the sequence formats are taken from common EMBOSS database configurations. Some of these are obviously synonyms e.g. "embl" and "em". In practice, the names available will depend on what's defined in your EMBOSS configuration files (see Section 2.8, “Maintenance”). For descriptions and examples of the supported formats see Section A.1, “Supported Sequence Formats”.

The supported sequence formats are summarised in the table below. The columns are as follows: Input format (format name), Output format (format name), Sngl (indicates whether each sequence is written to a new file. This behaviour is the default and can be set by the -ossingle command line qualifier. Save (indicates that sequence data is stored internally and written when the output is closed. This is needed for 'interleaved' formats such as Phylip and MSF), Try (indicates whether the format can be detected automatically on input), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented, Feat (whether the format includes feature annotation data. EMBOSS can also read feature data from a separate feature file). Gap (whether the format supports sequence data with gap characters, for example the results of an alignment), Mset ("true" indicates that more than one set of sequences can be stored in a single file. This is used by, for example, phylogenetic analysis applications to store many versions of a multiple alignment for statistical analysis) and Description (short description of the format).

Table 5.1. Input sequence formats

Input Format

Try

Nuc

Pro

Feat

Gap

Mset

Description

abi

Yes

Yes

Yes

No

Yes

No

ABI trace file

acedb

Yes

Yes

Yes

No

Yes

No

ACEDB sequence format

clustal

Yes

Yes

Yes

No

Yes

No

Clustalw output format

codata

Yes

Yes

Yes

Yes

Yes

No

Codata entry format

dbid

No

Yes

Yes

No

Yes

No

Fasta format variant with database name before ID

embl

Yes

Yes

No

Yes

Yes

No

EMBL entry format

experiment

Yes

Yes

Yes

No

Yes

No

Staden experiment file

fasta

Yes

Yes

Yes

No

Yes

No

FASTA format including NCBI-style IDs

fastq

Yes

Yes

No

No

No

No

FASTQ short read format ignoring quality scores

fastq-illumina

No

Yes

No

No

No

No

FASTQ Illumina 1.3 short read format

fastq-sanger

No

Yes

No

No

No

No

FASTQ short read format with phred quality

fastq-solexa

No

Yes

No

No

No

No

FASTQ Solexa/Illumina 1.0 short read format

fitch

Yes

Yes

Yes

No

Yes

No

Fitch program format

gcg

Yes

Yes

Yes

No

Yes

No

GCG sequence format

genbank

Yes

Yes

No

Yes

Yes

No

Genbank entry format

genpept

No

No

Yes

Yes

Yes

No

Refseq protein entry format (alias)

gff2

Yes

Yes

Yes

Yes

Yes

No

GFF feature file with sequence in the header

gff3

Yes

Yes

Yes

Yes

Yes

No

GFF3 feature file with sequence

gifasta

No

Yes

Yes

No

Yes

No

FASTA format including NCBI-style GIs (alias)

hennig86

Yes

Yes

Yes

No

Yes

No

Hennig86 output format

ig

No

Yes

Yes

No

Yes

No

Intelligenetics sequence format

igstrict

Yes

Yes

Yes

No

Yes

No

Intelligenetics sequence format strict parser

jackknifer

Yes

Yes

Yes

No

Yes

No

Jackknifer interleaved and non-interleaved formats

mase

No

Yes

Yes

No

Yes

No

Mase program format

mega

Yes

Yes

Yes

No

Yes

No

Mega interleaved and non-interleaved formats

msf

Yes

Yes

Yes

No

Yes

No

GCG MSF (multiple sequence file) file format

nbrf

Yes

Yes

Yes

Yes

Yes

No

NBRF/PIR entry format

nexus

Yes

Yes

Yes

No

Yes

No

Nexus/paup interleaved format

pdb

Yes

No

Yes

No

No

No

PDB protein databank format ATOM lines

pdbnuc

No

Yes

No

No

No

No

PDB protein databank format nucleotide ATOM lines

pdbnucseq

No

Yes

No

No

No

No

PDB protein databank format nucleotide SEQRES lines

pdbseq

Yes

No

Yes

No

No

No

PDB protein databank format SEQRES lines

pearson

Yes

Yes

Yes

No

Yes

No

Plain old fasta format with IDs not parsed further

phylip

Yes

Yes

Yes

No

Yes

Yes

Phylip interleaved and non-interleaved formats

phylipnon

No

Yes

Yes

No

Yes

Yes

Phylip non-interleaved format

raw

Yes

Yes

Yes

No

No

No

Raw sequence with no non-sequence characters

refseqp

No

No

Yes

Yes

Yes

No

Refseq protein entry format

selex

No

Yes

Yes

No

Yes

No

Selex format

staden

No

Yes

Yes

No

Yes

No

Old staden package sequence format

stockholm

Yes

Yes

Yes

No

Yes

No

Stockholm (pfam) format

strider

Yes

Yes

Yes

No

Yes

No

DNA strider output format

swiss

Yes

No

Yes

Yes

Yes

No

Swissprot entry format

text

No

Yes

Yes

No

Yes

No

Plain text

treecon

Yes

Yes

Yes

No

Yes

No

Treecon output format

Table 5.2. Output sequence formats

Output Format

Sngl

Save

Nuc

Pro

Feat

Gap

Mset

Description

acedb

No

No

Yes

Yes

No

Yes

No

ACEDB sequence format

asn1

No

No

Yes

Yes

No

Yes

No

NCBI ASN.1 format

clustal

No

Yes

Yes

Yes

No

Yes

No

Clustalw multiple alignment format

codata

No

No

Yes

Yes

No

Yes

No

Codata entry format

das

No

No

Yes

Yes

No

Yes

No

DASSEQUENCE DAS any sequence

dasdna

No

No

Yes

No

No

Yes

No

DASDNA DAS nucleotide-only sequence

debug

No

No

Yes

Yes

No

Yes

No

Debugging trace of full internal data content

embl

No

No

Yes

No

Yes

Yes

No

EMBL entry format

experiment

No

No

Yes

Yes

No

Yes

No

Staden experiment file

fasta

No

No

Yes

Yes

No

Yes

No

FASTA format

fastq-illumina

No

No

Yes

No

No

No

No

FASTQ Illumina 1.3 short read format

fastq-sanger

No

No

Yes

No

No

No

No

FASTQ short read format with phred quality

fastq-solexa

No

No

Yes

No

No

No

No

FASTQ Solexa/Illumina 1.0 short read format

fitch

No

No

Yes

Yes

No

Yes

No

Fitch program format

gcg

No

No

Yes

Yes

No

Yes

No

GCG sequence format

genbank

No

No

Yes

No

No

Yes

No

Genbank entry format

gff2

No

No

Yes

Yes

Yes

Yes

No

GFF2 feature file with sequence in the header

gff3

No

No

Yes

Yes

Yes

Yes

No

GFF3 feature file with sequence in FASTA format after

gifasta

No

No

Yes

Yes

No

Yes

No

NCBI fasta format with NCBI-style IDs using GI number

hennig86

No

Yes

Yes

Yes

No

Yes

No

Hennig86 output format

ig

No

No

Yes

Yes

No

Yes

No

Intelligenetics sequence format

jackknifer

No

Yes

Yes

Yes

No

Yes

No

Jackknifer output interleaved format

jackknifernon

No

Yes

Yes

Yes

No

Yes

No

Jackknifer output non-interleaved format

mase

No

No

Yes

Yes

No

Yes

No

Mase program format

mega

No

Yes

Yes

Yes

No

Yes

No

Mega interleaved output format

meganon

No

Yes

Yes

Yes

No

Yes

No

Mega non-interleaved output format

msf

No

Yes

Yes

Yes

No

Yes

No

GCG MSF (multiple sequence file) file format

nbrf

No

No

Yes

Yes

Yes

Yes

No

NBRF/PIR entry format

ncbi

No

No

Yes

Yes

No

Yes

No

NCBI fasta format with NCBI-style IDs

nexus

No

Yes

Yes

Yes

No

Yes

No

Nexus/paup interleaved format

nexusnon

No

Yes

Yes

Yes

No

Yes

No

Nexus/paup non-interleaved format

phylip

No

Yes

Yes

Yes

No

Yes

Yes

Phylip interleaved format

phylipnon

No

Yes

Yes

Yes

No

Yes

No

Phylip non-interleaved format

selex

No

Yes

Yes

Yes

No

Yes

No

Selex format

staden

No

No

Yes

Yes

No

Yes

No

Old staden package sequence format

strider

No

No

Yes

Yes

No

Yes

No

DNA strider output format

swiss

No

No

No

Yes

Yes

Yes

No

Swissprot entry format

text

No

No

Yes

Yes

No

Yes

No

Plain text

treecon

No

Yes

Yes

Yes

No

Yes

No

Treecon output format

5.2.3. Contents of a Sequence Entry

An entry in a sequence databank will typically include a code and other information to identify the sequence, some bibliographic information, sequence annotation including a description of any features and, of course, the sequence itself.

An excerpt of the EMBL entry for a beta-glucosidase mRNA sequence is shown below:

5.2.3.1. Identification

Ids and Accessions

An entry in a database must have some way of being uniquely identified. Most sequence databases have two such identifiers for each sequence - an ID name and an accession number.

Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the life of the database. If two sequences are merged, then the new sequence will get a new accession number and the accession numbers of the merged sequences will be retained as 'secondary' accession numbers. EMBL, GenBank and Swissprot share an accession numbering scheme - an accession number uniquely identifies a sequence within these three databases. In contrast, ID names are not guaranteed to remain the same between different versions of a database, although in practice they usually do.

Why are there two such identifiers? The ID name was originally intended to be a human-readable name that indicate the function of its sequence. In EMBL and GenBank the first two (or three) letters indicated the species and the rest indicated the function, for example hsfau is the 'Homo Sapiens FAU pseudogene'. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Instead, the accession numbers started to be also assigned as the ID name. Therefore you will now find ID names like AF061303 are the same as the accession number for that sequence in EMBL.

Most sequence formats include an identifier code in some form or another. Typically this is an accession number and/or identifier name (ID) and is given near the top of the entry. They uniquely identify an entry in the database.

For our EMBL entry, the accession number X56734 is given on the ID line and separately in the AC line:

5.2.3.2. Bibliographic Information

Most sequence formats have records for bibliographic information such as literature references, experimental details, author contact information, cross-links to other databases, and much more besides. In the example below, the date of release (DT,) a description (DE), keywords (KW), organism species (OS), organism classification (OC) and reference information (RN, RP, RX, RA, RT and RL) are given:

5.2.3.3. Annotation and Features

Most sequence formats have records for descriptions, annotations and comments provided with the sequence. Molecular features associated with the sequence, such as protein secondary structure or molecular recognition sites, are kept in a feature table. These are marked up by FT records in the EMBL entry below.

5.2.3.4. The Sequence

Sequences are usually represented in IUBMB standard one-letter codes (see http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html). There are exceptions, for example Staden format uses non-standard ambiguity codes. In the case of FASTA format the sequence is anything after the '>' line until the next entry starts. For other databases, records are used to delineate the sequence.

In EMBL entries, an SQ label is used to identify the sequence (the full sequence is not given):

5.2.4. Specifying Sequences on the Command Line

Sequences are referred to on the EMBOSS command line by their Uniform Sequence Address or USA (Section 6.6, “The Uniform Sequence Address (USA)”). This is a standard sequence naming scheme used by all EMBOSS applications. A USA specifies one or more sequences that might be read from or written to a file or to a larger databank. Other sequence sources such as an applications or web servers can also be specified.