Monday, 11 April 2016

Introduction

As of April 2016, there are about 70,000 genome assemblies in Genbank (draft and complete), with the majority being bacterial genomes. For genomes that have been submitted in NGS era, the COMMENT section of the Genbank file header has machine readable information about the sequencing technology, depth of coverage, and software used.

Method

I decided to parse this header for all the bacterial .gbff.gz (GenBank File Format, aka .gbk) files available at NCBI FTP to see what genome assembly software is being used for bacterial genomes. Now, like any user provided information, there is a lot of junk in this field, so I wrote some curated regexps to categorise them into cleaner bins. If more than one method was listed, I binned into Hybrid/Mixed. If if it was too minor or probably wrong I binned as Could not parse.

Results

Count

Assembler Software

23725

Not provided

9883

AllPaths

5325

Newbler

3783

Velvet

3585

CLC Genomics Workbench

3347

Spades

2610

IDBA

2477

Celera Assembler

2082

ABYSS

1815

CLC NGS Cell

1782

SOAPdenovo

1370

Could not parse

1119

HGAP

870

MaSuRCA

853

MIRA

793

A5-MiSeq

308

Ray

149

Phred/Phrap/Consed

132

Geneious

110

SeqMan

109

HGAP3

98

Edena

69

Hybrid/Mixed

59

DNAstar

55

Platanus

53

NextGene

20

Arachne

19

DISCOVAR

9

VelvetOptimiser

5

Falcon

4

Megahit

66618

Total

Discussion

I was a little surprised to see ALLPATHS top the list due to its particular requirements for DNA library construction (overlapping PE + long mate pair), but the Broad Institute does do a lot of sequencing. A lot of people are using Velvet and Spades, but equal many using CLC Workbench or the NGS Cell product.

The most disturbing and funniest entries in the Could not parse division are listed below.

At the top we have in-house software (with a version number!). The Direct Sequencing could be a single perfect read of full chromosome from a really lucky Oxford Nanopore user. Is there anything Artimist (aka Artemis) cannot do? I need to upgrade my version of Trimmomatic and "actual" BLASTN too.

Conclusion

My main concern is the number of read aligners listed. There are some draft genomes myself and others have encountered where it appears the submitters have just aligned the reads to a close reference and submitted the consensus sequence as the assembly. These "genomes" sometimes cause problems in population studies, and I'd rather the reads be available instead.