Affiliation

Abstract

The identification and characterization of the complete ensemble of genes is a main goal of deciphering the digital information stored in the human genome. Many algorithms for computational gene prediction have been described, ultimately derived from two basic concepts: (1) modeling gene structure and (2) recognizing sequence similarity. Successful hybrid methods combining these two concepts have also been developed. We present a third orthogonal approach to gene prediction, based on detecting the genomic signatures of transcription, accumulated over evolutionary time. We discuss four algorithms based on this third concept: Greens and CHOWDER, which quantify mutational strand biases caused by transcription-coupled DNA repair, and ROAST and PASTA, which are based on strand-specific selection against polyadenylation signals. We combined these algorithms into an integrated method called FEAST, which we used to predict the location and orientation of thousands of putative transcription units not overlapping known genes. Many of the newly predicted transcriptional units do not appear to code for proteins. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many apparent "genomic deserts."

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1. Information Flow in FEAST

1

The…

Figure 1. Information Flow in FEAST

9

The genomic sequence is analyzed using RepeatMasker, yielding a…

Figure 1. Information Flow in FEAST

The genomic sequence is analyzed using RepeatMasker, yielding a masked sequence (studied for its base composition), a repeat table, and an alignment file, which is used to list mutations in repeats and to produce a “sequence mask.” Both the original sequence and the sequence mask are studied using polyadq, yielding tables of predicted PASs. The nucleotide composition of the unique sequence, and the mutations within repeats, is tabulated as well. The tables are then analyzed to calculate skews, which are finally used to produce predictive scores, separately for each method (Greens, ROAST, CHOWDER, and PASTA) or in combination (FEAST).

Figure 2. FEAST Reanalysis of Known Genes

Figure 2. FEAST Reanalysis of Known Genes

8

Scatterplot of FEAST scores versus gene length for…

Figure 2. FEAST Reanalysis of Known Genes

Scatterplot of FEAST scores versus gene length for known genes from the UCSC Genome Bioinformatics Site [20]. Genes overlapping known genes on the complementary strand were excluded. Scores greater than 3 are considered significant.

Figure 3. FEAST Reanalysis of Existing Annotation

Figure 3. FEAST Reanalysis of Existing Annotation

8

Success rates for FEAST reanalysis of known genes…

Figure 3. FEAST Reanalysis of Existing Annotation

Success rates for FEAST reanalysis of known genes (top left), experimental gene annotations (center and bottom left), and gene predictions (right). Gene annotations were stratified by length into three classes: short (<10 kb), medium (10 to 100 kb), and long (>100 kb); the number of genes in each class is given above each bar. FEAST scores were stratified into nonsignificant (white, −2 < Z < 2), giving significant scores for the expected strand (shades of brown, Z > 2) and giving significant scores for the wrong strand (shades of red, Z < −2). The Z < −4 and Z > 4 bins include potentially large values as displayed in Figure 2. Columns labeled with asterisks include the gene regions longer than 100 kb remaining after subtraction of overlaps with known genes, on which FEAST had been trained.

Figure 4. FEAST Scores at Gene Boundaries

Figure 4. FEAST Scores at Gene Boundaries

8

The average FEAST scores for known genes (thick…

Figure 4. FEAST Scores at Gene Boundaries

The average FEAST scores for known genes (thick black, n = 10,023), aligned at the position of gene start, show a sharp shift from nonsignificant values (near 0) outside the gene, to significant values at the 5′ end of the gene. The opposite shift is seen at the gene end, although it is more gradual. RNA cluster sequences (thin red, n = 13,749) show a very similar graph. Twinscan predictions (dashed green, n = 9,131) display positive FEAST scores outside the predicted regions, suggesting an underprediction of gene ends, particularly toward the 5′ end. Known genes, RNA clusters, and Twinscan predictions shorter than 20 kb were excluded from this analysis.

Figure 5. Genomewide Comparison of Gene Annotations

Figure 5. Genomewide Comparison of Gene Annotations

8

The matrix of disagreement measures for all pairs…

Figure 5. Genomewide Comparison of Gene Annotations

The matrix of disagreement measures for all pairs of annotation methods is represented by point in two dimensions using MDS. Filled black circles represent experimentally observed transcripts, the vast majority being in the “RNA” set. Triangles represent methods involving significant manual curation and/or based on the RNA set. “S,” “H,” and “F” represent methods based on gene structure prediction, hybrid methods (gene structure and sequence similarity), and methods measuring footprints of transcription, respectively. The combined FEAST method was excluded from the MDS analysis, and its projected location (squared F) was calculated later (see Materials and Methods). Note that, like geographical maps of intercity distances, MDS representations have no axes.

Figure 6. CPHL1, a Novel Ceruloplasmin-Like Gene

Figure 6. CPHL1, a Novel Ceruloplasmin-Like Gene

8

Standard UCSC Genome Browser view of the CP…

Figure 6. CPHL1, a Novel Ceruloplasmin-Like Gene

Standard UCSC Genome Browser view of the CP locus showing a 90-kb “desert” separating it from the next known gene, LOC116441, and GESTALT view of the same locus, indicating the extent of the transcribed region predicted by ROAST (red bar in ROAST track) and the predicted gene structure for CPHL1. Interspersed repeats are color-coded, with red, green, pink, and brown bars representing Alu, MIR, LINE, and other repeats, respectively, and bar height indicating repeat age (younger repeats are taller); the megabase scale starts at the p telomere. The newly discovered gene overlaps with a gene structure predicted by Twinscan (chr3.151.005.a) but shares only seven of 21 exons, one imprecisely. GenScan predicts a much longer structure continuous with the CP gene, sharing 14 exons with CPHL1, of which ten are precisely predicted. Inset: Phylogenetic analysis of the CP/CPHL1 family rooted using the hephaestin protein sequence as outgroup. Numbers above branches represent percentage bootstrap support over 1,000 replicates; the horizontal bar indicates 10% divergence along each branch.

Figure 7. GESTALT View of the AGBL1…

Figure 7. GESTALT View of the AGBL1 Locus between the AKAP13 and NTRK3 Genes on…

Figure 7. GESTALT View of the AGBL1 Locus between the AKAP13 and NTRK3 Genes on Human Chromosome 15, 84.1 to 86.1 Mb from the p Telomere

PASTA, Greens, CHOWDER, and FEAST predictions are displayed for each strand in brown, green, pink, and red, respectively, with lighter shades indicating less significant scores. In the FEAST track, actual scores are indicated in red, and maximal segments are displayed in blue. The AGBL1 gene structure was modeled based on translated sequence similarity to the AGTPBP1 protein.

Figure 8. The Highest-Scoring Novel Predicted Transcript,…

Figure 8. The Highest-Scoring Novel Predicted Transcript, LOC401237

7

VISTA and GESTALT analyses of the LOC401237…

Figure 8. The Highest-Scoring Novel Predicted Transcript, LOC401237

VISTA and GESTALT analyses of the LOC401237 locus, showing sequence conservation with the mouse, chicken, and frog orthologous loci; the observed intron-exon structure of LOC401237 and location of neighboring genes, with black circles representing CpG islands; the integrated FEAST scores for the forward (+) and reverse (−) strands, with the black arrow representing the calculated maximal segment; the repeat distribution on both strands, with red, green, pink, and brown bars, respectively, representing Alu, MIR, LINE, and other repeats, and bar height indicating repeat age (younger repeats are taller); the megabase scale, range 21.7 to 22.4 Mb from the p telomere. Inset on top: Detail on the conserved intronic noncoding sequences, between two nonconserved exons.

Figure 9. A Third Basic Concept

1

By…

Figure 9. A Third Basic Concept

9

By studying various sources of sequence information (pink boxes),…

Figure 9. A Third Basic Concept

By studying various sources of sequence information (pink boxes), genes have been identified using a variety of computational methods based on the identification of gene structure and/or the identification of sequence conservation. The FEAST methods represent a third basic concept, in which sustained transcriptional activity is inferred by its mutational and selective effects on the genomic sequence, the “transcriptional footprints.” Light blue boxes indicate the three basic concepts for gene prediction. The dashed vertical line separates gene prediction (to the left), from gene identification (to the right): the latter is based on the analysis of sequences expressed from the same locus.