Overview

Evidence collection

Blast evidence:

Blast homology search against the Genbank's NR database produces a set of raw blast output. Individual blast alignments are then clustered into single blast clusters by linking the blast alignments derived from the same blast hit. Several such overlapping blast clusters on the genomic region represents what we call as blast loci on the genome assembly. Currently, all blast hits with e-values better than 1e-10 are used as blast evidence.

Pfam domains:

We run Hmmer searches using Pfam/TIGRfam library to find Pfam/TIGRfam domains on six-frame translations of the genomic sequence.

Gene Models Based on Annotation Transfer from Reference Genome(s)

Well-curated annotations from reference genomes, if available, are transferred to the current genome assembly to improve our automated annotation. Broad's in-house synteny-based gene transfer process has two main steps. First, we find collinear blocks between the two genomes by creating pair-wise alignments between the two genomes, and then generate global alignment for the entire region the collinear blocks cover. In the second step, we use an in-house gene mapping program to transfer genes from reference onto the target genome within the specific syntenic blocks, and we use genewise to further refine a gene model at each locus.

Gene Models Based on Blast Evidence

Broad's in-house program, "findBlastOrfs", leverages BLASTX alignments to build a complete gene model from the hit. It is particularly useful in low-coverage genomes with frame shifts or gaps in coverage. Ab initio gene predictors generally produce truncated predictions at best, and no prediction at worst, when they encounter an incorrect stop produced by a frame shift. Furthermore, ab initio tools generally produce wildly different results when confronted with a sequence with gaps in it.

Selection of Consensus Gene Models

Ab initio predictions, models generated using blast hits against NR, transferred reference gene models, and manual gene models are clustered into potential gene loci. At each locus, we select the most likely non-conflicting gene models based on the best evidence available, e.g., Pfam hits, length agreement with the BLAST hits, and overlap to non-coding RNA features. Gene models with problems are tagged appropriately with curation flags and notes in the gene report to indicate potential problems. Despite all the progress in the field of gene finding, accurate gene finding on draft genomes is still a challenge. We make an effort to track likely problematic gene models and tag them with appropriate curation flags to alert the users of the nature of the problems. These tags are also used by manual annotators to specifically target manual editing and fine-tuning of bad gene models.