TXGP ens63 reference

Contents

Overview

One of the most interesting questions we can ask with X. laevis genome would be how many genes it has. To construct gene models, we are mainly focusing on de novo transcriptome assembly approach with our RNA-seq data. However, de novo transcriptome assembly programs generate many 'false positive' transcripts. Also, because of allotetraploidy in X. laevis, transcriptome data may contain many transcript variants for each gene. So, to estimate the gene model from transcriptome data precisely, we would like to combine all transcripts candidates for each gene together, and analyze them separately. Sequence-based clustering is a natural way to do this, but we need to optimize parameters, such as %identity to define a cluster. To get some ideas for this, we have looked at genes and transcripts of several well-studied organisms.

Genes & Transcripts

This figure shows total number of genes and transcripts in each organisms. The number on top of green bar means total number of transcripts, and the number on top of blue bar means total number of genes (based on EnsEMBL v.63 annotation). The number on top of cyan bar means the number of genes that contain only one transcript.

Clustering of transcripts

We clustered all transcripts for each organism with usearch program with different %id cutoff. The number on top of red bar means the ratio of 'the number of clusters' to 'the number of genes'. The number on top of pink bar means the ratio of 'the number of clusters having more than one gene' to 'the number of clusters'. Although they may be very closed paralogous genes, we considered this number as 'clustering error'.

Ultimate goal is to match the number of genes to the number of clusters in all organisms.

Human and mouse have too much transcripts compared to other organisms, so we would allow 1.5x more clusters than total number of genes in these organisms.

Although X. tropicalis is the closest model organism, it does not have many transcripts yet. So we use D. rerio(zebrafish) to estimate 'optimal number of clusters'.

We would like to control 'clustering error' less than 0.10 (it may be little bit higher than conventional cutoff, i.e. 0.05. But, as mentioned earlier, it may also contain many paralogous genes, so it is unlikely that all of them are clustering errors.)

With these criteria, we determined '%id>0.80' as an optimal cutoff for transcripts clustering for gene model estimation. It does not mean that we discarded all other sequences in clusters except one representative sequence. We just grouped them for further analysis.

Length of transcripts

We also looked at the distribution of transcript length with same data. Except few genes in C. elegans, and D. melanogaster, there is no gene that can make transcripts longer than 10kbp. So we could consider a transcript longer than this would be a false positive. Also, more than 90% of transcripts are longer than 500 bp except human('bottom10pct' below;maybe because a lot of short transcripts are annotated on human genome), so we may discard assembled transcripts shorter than 500bp if they are much more than 10% of total transcripts.