MLOGD: Programme for detecting
overlapping coding sequences

Summary: This is a suite of software for detecting new
protein-coding sequences (CDSs) by analysing the pattern of mutations
across an input sequence alignment. In particular, the software can
be used to detect new CDSs that overlap known CDSs in a different
read-frame. Such CDSs can be difficult to detect with standard
gene-finding algorithms (see below). The software is particularly
useful for analysing virus genome alignments, where overlapping genes
and ribosomal frameshifting sites are common.

You can enter your sequences into the web interface (recommended) or
download the programmes (written in C++ and csh scripts; user
instructions for LINUX) to run locally.

Please use the following login details if requested. Note that these
will only allow access to public parts of this site. If you get an
'access denied' error then you are probably trying to access a
non-public part. Please contact me (aef24cam.ac.uk).

Not a 'universal' gene-finder. The CDS has to be subject to
purifying selection (e.g. HCV F ORF not detected). Overlapping CDSs
that are less conserved (at the amino acid level) than the genes they
overlap are often missed, so a negative MLOGD signal doesn't mean an
ORF is non-coding.

-2 frame overlaps generate false positives.

Some scatter into low positive scores -> need to set thresholds
to avoid false positives. (I like to discard ones where the
'mean log likelihood ratio per nucleotide'
is less than one sixth of the 'sequence
divergence' [i.e. where y < x/6 in the statement 'Sum over phylogenetic
tree is located at (x,y)' at the bottom of the 'likelihood ratio plot'
in the 'Test input query CDSs' results page].)

If ORF is very short, MLOGD may give a false positive if
certain columns are constrained (e.g. due to RNA secondary structure
or regulatory region).

Overview:

Overlapping protein-coding sequences (CDSs) are particularly common in
viruses but also occur in more complex genomes. Detecting such genes
with conventional gene-finding algorithms can be difficult for several
reasons. Due to the double-coding constraints, overlapping CDSs often
display an atypical codon bias. Extending training-set methods, such
as HMMs, to overlapping CDSs is made difficult by the several
different frames (each requiring its own model) and limited training
data. Similarity to known sequences or conservation between species
may only point to the existence of one of an overlapping pair.
Furthermore, overlapping genes on the same read-strand (e.g. at
ribosomal frameshifting sites) may have the same promoter and mRNA, so
that looking for promoters or transcription may only identify one of the
two genes. Nonetheless overlapping CDSs have their own signatures
resulting from the mutational constraints imposed by the requirement
of simultaneously maintaining protein function in both genes.

The original MLOGD was a suite of software for analysing the mutation
patterns in a multiple sequence alignment and estimating the relative
likelihood that a given sequence region is single-coding or
double-coding. The mutation model includes a nucleotide mutation
matrix, codon usage table and amino acid substitution matrix. The
suite also included a Monte Carlo single/double-coding sequence
evolution simulator, for determining confidence scores and other
statistics (as a function of sequence composition, length, divergence
time and the double-coding frame).

The current version of MLOGD has been improved in several respects,
and is now much more user-friendly than the original version. There
are three running modes. In the 'Test input query CDSs' mode, the
user inputs an alignment, annotation of known CDSs in a reference
sequence, and the position of a query, or hypothetical, CDS. MLOGD
then calculates the likelihood ratio between the null model (only the
known CDSs are coding) and the alternate model (both the known CDSs
and the query CDS are coding). This may involve combinations of the
non-coding, single-coding and double-coding mutation models. In the
'Find and test all non-annotated ORFs' mode, MLOGD will look for all
non-annotated ORFs above a given length in the reference sequence, and
calculate the above likelihood ratio for each ORF. In the 'Six-frame
sliding window plots' mode, MLOGD will calculatate the likelihood
ratio in sliding windows in all six possible read-frames. Positive
regions in the plots may indicate unannotated CDSs.