Incorporating RNA-Seq into AUGUSTUS

This document describes a method for
structurally annotating a genome based on massive cDNA sequencing
(RNA-Seq). The guide below is a description of the method we
developed for and applied in the RNA-Seq based Genome Annotation
Assessment Project (rGASP).
It was applied to a variety of data types, such as from Illumina and
SOLiD, for read lengths between 33bp and 75bp, single and paired-end
reads, stranded and unstranded reads, on human, Drosophila and
C.elegans. We therefore expect this to be generally applicable to
other projects with some parameter adjustments. The general approach
is to generate hints for Augustus from the RNA-Seq, which can be used together
with hints from other sources if available (like from an existing
gene models, ESTs, protein or genomic conservation, MS/MS). The
approach outlined below is preliminary, and certainly can be improved
upon, e.g. by adding more information from the RNA-Seq in the form of
hints. Also, it is just one way to translate RNA-Seq information to
gene structure hints and you can adjust it for your processed data if you have
already a mapping or assembly of reads.
For example, if you choose to first
assemble RNA-Seq reads to contigs, you may just just treat them as
other larger transcript fragments such as ESTs and use
BLAT/GMAP+blat2hints.pl to generate the hints.

1.
Align RNA-Seq Reads to Genome

For species with many multi-exon genes
(most species), we found that spliced reads are by far most
valuable. Spliced reads are reads that span an exon-exon boundary
and therefore delineate a complete intron. It is therefore important
to chose an alignment method that recovers as many true spliced reads
as possible.

1.1 Illumina:

In lack of a better program, we chose
to use the program BLAT, which is not customized for NGS reads (Next
Generation Sequencing), but found more introns than Tophat 1.0.11. However,
it is straightforward to adjust this pipeline to Bowtie/Tophat input,
because Tophat produces results in the same or similar format (wig
file and intron list including multiplicities).

Then filter alignments for alignment
quality, uniqueness and pairedness, if applicable.

cat
ali.psl | filterPSL.pl --uniq --paired > ali.f.psl

This filter requires by default a
minimum coverage of 80% of the read length and a minimum percent
identity of 92%.

--uniq
makes the filter script keep only alignments, which are 1) the best
alignment for that read and 2) unique in the sense that all other
alignments of that read have a score that is at most 0.96
(uniquethresh) the score of the best. The rationale behind stringent
filtering is that self-similarity of the genome or of transcripts in
particular may otherwise yield many false positive alignments. False
positive alignments can be a greater problem to Augustus as false
negative alignments, the later may frequently be corrected by the ab
initio model aided by neighboring alignments.

--paired
is another filter that only applies to paired-end reads. It requires
that two reads from the same read-pair (when the read names end in
.f,.r or /1,/2) are on opposite strands, on the same genomic contig
and at most maxintronlen (default: 500000) apart from each other.

For single-end reads just run

cat
ali.psl | filterPSL.pl --uniq > ali.f.psl

Repeatmasking alone usually does not filter out
repetitions of DNA to the extend that RNA-Seq read mapping become
unambiguous or false positive. From our experience it can happen that even with the stringent
--unique filtering reads map to a likely wrong place in the genome. This is
frequently a region that is not a gene but shows high genomic similarity to
another region with a gene. You may therefore consider increasing the
stringency of the filtering by removing hits from regions that are repeated
similarly in the genome (and are not repeatmasked nonetheless). We chose to
remove hints that overlap a self-chain alignment of the UCSC Genome Browser.

Sort alignments by target sequence and
within target sequence by position for further processing

cat
ali.f.psl | sort -n -k 16,16 | sort -s -k 14,14 > ali.fs.psl

1.2 SOLiD:

Color-space reads cannot be aligned
with BLAT. Therefore we chose SHRIMP, e.g. in the case of 36bp reads
we ran

We currently generate two types of
hints from the RNA-Seq alignments, exonpart and intron
hints. Intron hints are more important but can currently not be
generated by all NGS mapping programs: Most of them perform
no spliced alignment.

1.2 exonpart hints and wiggle
track

Each exonpart hint specifies an
interval that is likely to be exonic. We take the local coverage
depth into account. For the purpose of exonpart hint generation and
visualization we first summarize the alignments to a coverage depth
function in wiggle format. Wiggle format (.wig) is a format of the
UCSC Genome Browser.

aln2wig
-f ali.fs.psl > cov.wig

This command works both for Illumina
and SOLiD reads. For SOLiD you replace ali.fs.psl
with shrimp.fs.out. If you
have stranded reads, you should generate two wig tracks, e.g.
plus.wig and minus.wig
and treat them separately below. The corresponding hints
should then have the strand column set. This will help Augustus to
determine the strand of a gene fragment.

or adjust this command line accordingly
when you have a stranded wig file. This command generates for each
coverage island a sequence of adjacent 20bp exonpart hints
(exonpart = ep) and also stores the average coverage depth in each
such window as a means to adjust the weight of the evidence later.

1.2 intron hints

Evidence about introns is collected
from spliced alignments. The crucial data is the exact intron
boundaries and the multiplicity of the intron, i.e. the absolute
number of reads that span the intron.

blat2hints.pl
--intronsonly --in=ali.fs.psl --out=hints.introns.gff

If you chose a different approach,
produce an output format like this for the intron hints:

chr1 b2h intron
14800 14877 0 . . mult=47;src=E

In this tab-separated format the
14800th base of sequence chr1 is
the first base of the intron, 14877 is the last, 47 is the
multiplicity of the intron and E is the source identifier for
setting the bonus of this evidence source below. The strand of an
intron will be determined automatically by Augustus.

1.3 joined hints

Concatenate all hints:

cat
hints.introns.gff hints.ep.gff
[hints.other.gff] > hints.gff

The
optional hints.other.gff may contain hints from other evidence
sources. It may be preferable for efficiency to split the hints file
by chromosome.

3.
Run Augustus

3.1 Set hint parameters

Adjust the file extrinsic.cfg that
holds the hint parameters. Start by copying the file
config/extrinsic/extrinsic.M.RM.E.W.cfg. In the example above the
relevant, non-neutral lines could look like this:

The exonpart malus of .992 means
a weak penalty factor for every predicted exonic base that is not
supported by any exonpart hints. The exonpart bonus for hints of
source W of 1.005 mean that gene structures get this bonus factor for
every exonpart hint of multiplicity 1 that is completely included in an
exon. Introns that are not supported by any intron hint are
penalized by .34, and introns that are supported by RNA-Seq hints are
rewarded by a factor of 100,000. The 0.985 are local
malus factors. The concept of a local malus was recently introduced to
account for different levels of missing information, that becomes
more important with with higher general coverage of RNA-Seq. Like the (normal)
malus, the local malus also applies to exonic bases that are unsupported by
hints. In contract to the normal malus, the local
malus only applies to exons, that are well-supported at some region and not
supported at another region of the same exon. The following picture
illustrates a typical case where the local malus applies.

This UTR exon candidate is very unevenly supported by RNA-Seq coverage and therefore likely to be too
long. For every base that is not covered by any exonpart hint a local UTRpart malus
factor of 0.985 is applied in addition to the normal exonpart malus
of 0.992.

The nonexonpart bonus of the RM source in this case is for
incorporating hints from repeat masking (src=RM). Augustus is run on
unmasked sequence with the repeat mask intervals given as nonexonpart
hints. In this regions, introns and intergenic regions are rewarded by
1.01 for each base.

Above parameters have to be seen as an
example. The optimal values will depend on various parameters,
including

1) Amount of RNA-Seq and percentage of
genes expressed in the library. The more transcripts are expressed
and covered by RNA-Seq the stronger the malus can be. In the extreme
(hypothetical) case, all exonic bases are covered with RNA-Seq and
unsupported predicted exonic regions can be punished hard.

2) The stringency of filtering. If, for
some other reason, you have chosen to include more than one
hit of a read or you allow best alignments also when they are
ambiguous, then you may get a significant false positive rate of
hints and you may have to reduce bonus factors significantly.

We recommend switching the UTR flag to
"on" because RNA-Seq covers UTR as well. With --UTR=off,
the exonpart hints would be (mis)interpreted to be hints for coding
parts of exons. --allow_hinted_splicesites=atac
allows Augustus to predict the (rare) introns that start with AT and
end with AC in addition to the GT-AG and GC-AG introns that are
allowed by default. --alternatives-from-evidence=true
turns on the prediction of alternative splicing. For this, intron
hints are in particular informative, as exonpart hints alone
will (if unstranded) not yield alternative transcripts.