GeMoMa

From Jstacs

Gene Model Mapper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.

Galaxy

GeMoMa is available in a public web-server at galaxy.informatik.uni-halle.de. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.

GeMoMa workflow adapted from Galaxy

Running the command line application

For running the command line application, Java v1.8 or later is required.

Extract RNA-seq Evidence (ERE)

For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line withjava -jar GeMoMa-1.5.jar CLI ERE [<parameter>=<value> ...]
The parameters comprise:

name

comment

type

s

Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)

proteins (whether the complete proteins sequences should returned as output, default = false)

BOOLEAN

c

cds (whether the complete CDSs should returned as output, default = false)

BOOLEAN

r

repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)

BOOLEAN

s

selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)

FILE

Ambiguity

Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)

STRING

sefc

stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)

BOOLEAN

f

full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)

predictions (The (maximal) number of predictions per transcript, default = 10)

INT

selected

selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)

FILE

as

avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)

BOOLEAN

approx

approx (whether an approximation is used to compute the score for intron gain, default = true)

BOOLEAN

align

align (A flag which allows to output a tab-delimited file, which contains the results in a blast-like format (deprecated), default = false)

BOOLEAN

genomic

genomic (A flag which allows to output a fasta file containing the genomic regions of the predictions, default = false)

BOOLEAN

prefix

prefix (A prefix to be used for naming the predictions, default = )

STRING

tag

tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)

timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)

LONG

outdir

The output directory, defaults to the current working directory (.)

STRING

GeMoMa returns the predicted annotation as gff file and the predicted proteins as fasta file.

GeMoMa Annotation Filter (GAF)

name

comment

type

t

tag (the tag used to read the GeMoMa annotations, default = prediction)

missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)

evidence percentage filter (Each gene annotation file is handled as independent evidence. A prediction is only returned if it is contained at least in this percentage of evidence files., valid range = [0.0, 1.0], default = 0.5)

DOUBLE

outdir

The output directory, defaults to the current working directory (.)

STRING

CompareTranscripts

For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line withjava -jar GeMoMa-1.5.jar CLI CompareTranscripts [<parameter>=<value> ...]
The parameters comprise:

name

comment

type

p

prediction (The predicted annotation)

FILE

a

annotation (The true annotation)

FILE

assignment

assignment (the transcript info for the reference of the prediction, OPTIONAL)

FILE

prefix

prefix (whether the prefix should be deleted, default = false)

BOOLEAN

outdir

The output directory, defaults to the current working directory (.)

STRING

AnnotationEvidence

For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line withjava -jar GeMoMa-1.5.jar CLI AnnotationEvidence [<parameter>=<value> ...]
The parameters comprise:

name

comment

type

a

annotation (The genome annotation file (GFF))

FILE

g

genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)

percentage of predicted introns per predicted transcript with RNA-seq evidence

minSplitReads

minimal split reads

GeMoMa

prediction

minimal number of split reads for any of the predicted introns per predicted transcript

tpc

transcript precentage coverage

GeMoMa

prediction

percentage of covered bases per predicted transcript given RNA-seq evidence

minCov

minimal coverage

GeMoMa

prediction

minimal coverage of any base of the prediction given RNA-seq evidence

avgCov

average coverage

GeMoMa

prediction

average coverage of all bases of the predition given RNA-seq evidence

score

GeMoMa score

GeMoMa

prediction

the score comupted by GeMoMa using the subsitution matrix, gap costs and additional penalties

iAA

identical amino acid

GeMoMa

prediction

percentage of identical amino acids between reference transcript and prediction

pAA

positive amino acid

GeMoMa

prediction

percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix

evidence

GAF

prediction

the number of reference organisms that have a transcript yielding this predicition

alternative

GAF

prediction

alternative gene ID(s) leading to the same prediction

maxTie

maximal tie

GAF

gene

maximal tie of all transcripts of this gene

maxEvidence

maximal evidence

GAF

gene

maximal evidence of all transcripts of this gene

FAQs

Why does the Extractor not return a single CDS-part, protein, ...?

First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.

How can I force GeMoMa to make more predictions?

There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.

Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?

By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:

Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").

No, GeMoMa is able to make predictions with and without RNA-seq evidence.

Is it possible to use multiple reference organisms?

It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. java -jar GeMoMa-<version>.jar CLI GAF) to combine these annotations.

Why do some reference genes not lead to a prediction in the target genome?

Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).

If the genes have been discarded, there are two possibilities:

The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.

There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.

If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:

GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").

GeMoMa simply did not find a prediction matching the remaining quality criteria

GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).

What does "partial gene model" mean in the context of GeMoMa?

We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.

For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?

GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. java -jar GeMoMa-<version>.jar CLI GAF) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).

A lot of transcripts have been filtered out by the Extractor. What can I do?

There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.

Is GeMoMa able to predict pseudo-genes/ncRNA?

No, currently not.

My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?

GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.

Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.

My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?

GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.

Does GeMoMa predict multiple transcripts per gene?

GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.

GeMoMa failed with java.lang.OutOfMemoryError. What can I do?

Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initally used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.