Tuesday, 24 July 2012

Navigating microbial genomes on the NCBI FTP site

If you click on the URL, there is a big list of folders, and it does look like a mess. But for those of us in microbial genomics there are a few key folders you should know about, and probably even have mirrored on your own servers:

Most of my work is in bacterial genomics, so I'll discuss the contents of the first four folders only. I'll leave the last two to an experienced mycogenomicist.

1. Bacteria

This directory contains a folder for each completed bacterial genome. That is, the genome has been finished to a single DNA sequence per replicon (usually just one chromosome) and is fully annotated. There are currently around 1000 completed bacterial genomes, of which I've been involved in about 10.

You can see a bunch of files, all with the same prefix (NC_104500) and a bunch of different suffixes or file extensions (gbk, gff) - some of which should be familiar to you. The NC_014500 is the RefSeq accession ID for the single chromosome of Dickeya dadantii. The most important files are:

In terms of usefulness, the .gbk file contains (nearly) all the information that the other files contain - the .faa and .fna files are easily generated from the .gbk using BioPerl etc. If you want to get the .gbk files for all the finished genomes, you can download the tarball NCBI provides: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.gbk.tar.gz

2. Bacteria_DRAFT

This directory contains folders for each draft bacterial genome. That is, the genome has been de novo assembled into contigs/scaffolds (eg. using Newbler for 454 data) but has not been, and probably never will be, finished. They are usually annotated, either by the submitter or automatically by NCBI, but sometimes there may be only sequences. There is about 2600 draft genomes currently.

Here's the contents of the Thiocapsa marina str. 5811 genome folder - it's a purple sulphur coccus from the Mediterranean Coast if you are interested.

NZ_AFWV00000000.asn 13.5 kB03/04/2012 03:19:00

NZ_AFWV00000000.contig.asn.tgz1.7 MB21/07/2012 02:13:00

NZ_AFWV00000000.contig.faa.tgz1.0 MB21/07/2012 02:13:00

NZ_AFWV00000000.contig.ffn.tgz1.5 MB21/07/2012 02:13:00

NZ_AFWV00000000.contig.fna.tgz1.6 MB21/07/2012 02:13:00

NZ_AFWV00000000.contig.frn.tgz4.1 kB21/07/2012 02:13:00

NZ_AFWV00000000.contig.gbk.tgz4.6 MB21/07/2012 02:13:00

NZ_AFWV00000000.contig.gbs.tgz4.2 kB21/07/2012 02:13:00

NZ_AFWV00000000.contig.gff.tgz393 kB21/07/2012 02:13:00

NZ_AFWV00000000.contig.ptt.tgz119 kB21/07/2012 02:13:00

NZ_AFWV00000000.contig.rnt.tgz1.5 kB21/07/2012 02:13:00

NZ_AFWV00000000.contig.rpt.tgz2.5 kB21/07/2012 02:13:00

NZ_AFWV00000000.contig.val.tgz1.6 MB21/07/2012 02:13:00

NZ_AFWV00000000.gbk 4.7 kB03/04/2012 03:19:00

NZ_AFWV00000000.rpt 257 B03/04/2012 03:19:00

NZ_AFWV00000000.val 6.0 kB03/04/2012 03:19:00

This folder looks a bit different to the finished genomes. It has a .gbk file, but you will notice it is quite small (4700 bytes), and if you look at it, you can see it has no sequence or annotation, only some meta-data and a reference to "WGS NZ_AFWV01000001-NZ_AFWV01000062".This means that this genome record consist of 62 other records; one for each contig in the assembly. These are stored in the compressed tar file NZ_AFWV00000000.contig.gbk.tgz as follows:

% tar ztf NZ_AFWV00000000.contig.gbk.tgz

NZ_AFWV01000001.gbk

NZ_AFWV01000002.gbk

NZ_AFWV01000003.gbk

...

NZ_AFWV01000061.gbk

NZ_AFWV01000062.gbk

So, in summary, instead of getting a nice neat single .gbk or .faa file for each replicon as you do for the completed genomes, you get a tarball of files for each assembly, with each file representing a contig in the draft genome. Any extra chromosomes or plasmids will be mixed in the bag of contigs.

3. Plasmids

The plasmids folder is not known to many people, it seems a bit hidden away frankly. It contains ~3000 completed plasmid sequences. Confusingly, ~1000 of these are duplicated from the Bacteria folder (as the plasmid was sequenced with its parent), while the other ~2000 are novel. Even more annoying is that the folder structure is different:

faa/21/07/2012 19:39:00

fna/21/07/2012 19:40:00

gbk/21/07/2012 19:41:00

...

plasmids.all.faa.tar.gz43.2 MB23/07/2012 19:43:00

plasmids.all.fna.tar.gz75.1 MB23/07/2012 19:43:00

plasmids.all.gbk.tar.gz199 MB23/07/2012 19:43:00

...

Now we have a folder for each file extension, which each contains 3000 files. So the files for a particular plasmid are spread out over multiple folders. Fortunately they provide compressed tar files of the whole archive to download directly: plasmids.all.gbk.tar.gz

4. Viruses

Some of you may be wondering why I am including Viruses in this story. Well, some viruses infect Bacteria too - they are called bacteriophage. There are ~3000 folders in the Viruses division, but not all of them are bacteriophage. A simple grep for "phage" suggests ~600 are bacterial viruses. The folder structure is the same as for the finished Bacteria genomes.

It is important to realise that most of these virus sequences are natively dsDNA and will also appear integrated into the chromosomal DNA of many of the entries in Bacteria and Bacteria_DRAFT.

I don't know why the DRAFT folder is set out differently to the finished genomes. One reason for the lack of all.*.tar.gz files could be that there is 4x as many draft genomes and the files are just too big for people to reliably download.

You could use an FTP client that allows wildcards (eg. ncftp) so you can do "mget */*.fna" in the folder.

Hello,I'm using the Salmonella enterica directory , but it contains many sub-directories with incomprehensible names which variously contain the .gbk etc files. All the sub folders no doubt represent the hundred of salmonella enterica serovars, but I don't understand how I'm supposed to be able to navigate to the correct folder- I can't work out what the system is here.Do you have any idea? I would be very grateful for any light you could shed!

The trouble is that the folder names are incomprehensible (at least to me) e.g. GCF_000006945, and that many of the folders are empty.

For example, I am looking for Salmonella Typhimurium strain SL1344 (refseq: NC_016810), but I have no idea in which of the 500 or so folders all named "GCF_000****" to look in... Do you know what these folder names mean?(Here is the link to the directory I am talking about in case it helps... ftp://ftp.ncbi.nih.gov/genomes/ASSEMBLY_BACTERIA/Salmonella_enterica/ )

Hello, do you have any idea why there are multiple gbk files for the same strain in the same folder? For example, Acetobacter_pasteurianus_386B_uid214433 folder has many gbk files included in it. I am writing a software that downloads specific bacterial genomes and parse them to store data in a database.

Hi. Keep up the great work, this blog has been really helpful. I'm new to the whole bioinformatics thing, so I am stumbling around in the dark a bit here. Could you tell me what the ASSEMBLY_BACTERIA folder is about ?

Not that I know of. All bacteria are probably pathogenic, depending on that environment they are placed in. eg. some hurt plants but not animals etc. Some are fine on our skin, but bad in our bloodstream.