Post navigation

How To Format Sequence Data For GenBank Submissions

Submitting sequences to GenBank can seem complicated at first, but starting with a solid foundation in the form of a properly formatted file will make the process go smoothly.

Before submitting sequence data to GenBank, the data must be formatted correctly, the most common file format being FASTA. This post will show you how to create a FASTA file for submitting single- and multiple-nucleotide sequences.

The image below depicts a single sequence in FASTA format. For multiple sequences, such as those of population or phylogenetic studies, environmental samples, and batch sequences of the same gene, create the file using the steps below and put the set of sequences together in a single FASTA file.

Here is how to create the FASTA file:

1) We strongly recommend that you use a text editor. If you use a word processing program, you must save the file as plain ASCII text in order to retain the FASTA format.

2) Create a short, unique sequence ID (SeqID) that you can use for each sequence. This functions as a placeholder until GenBank assigns accession numbers to replace them.

The following is an example of a good SeqID: 1234_abc

You can also use a unique isolate number, unique clone number, or other simple unique IDs.

Please limit the SeqID to 25 characters or less. Use of brackets (“[]”) in the SeqID is also prohibited.

3) Type the greater than caret > and then the SeqID. Then press the SPACE key on your keyboard. To ensure the FASTA file will be read by Sequin or BankIt, a single space is required before entering the [organism=genus species] information.

Example:

>Seq_123 [organism=Homo sapiens] [isolate=456]

4) Use square brackets around the formatted organism data like this: [organism=Genus species]