Wiki

Function

Remove poly-A tails from nucleotide sequences

Description

trimest reads one or more nucleotide sequences and writes them out again but with any 3' poly-A tail (or, optionally, 5' poly-T tail) removed. It detect any poly-A and poly-T tails in the input sequences that are at least the specified minimum length. The tails may continue a defined num of non-A or non-T bases. If both a 5' poly-T tail and a 3' poly-A tail is identified, it removes the longest of the two. The output is a set of sequences with the poly-A (or poly-T) tails removed. If a sequence had a 5' poly-T tail then the resulting sequence is reverse-complemented by default. The description line has a comment appended about the changes made to the sequence.

Algorithm

trimest looks for a repeat of at least -minlength A's at the 3' end (and, by default, -minlength T's at the 5' end). If there are an apparent 5' poly-T tail and a poly-A tail, then it removes whichever is the longer of the two.

By default, it will allow -mismatches non-A (or non-T) bases in the tail. If a mismatch is found, then there has to be at least -minlength A's (or T's) past the mismatch (working from the end) for the mismatch to be considered part of the tail. If -mismatches is greater than 1 then that number of contiguous non-A (or non-T) bases will be allowed as part of the tail.

Usage

Command line arguments

Remove poly-A tails from nucleotide sequences
Version: EMBOSS:6.5.0.0
Standard (Mandatory) qualifiers:
[-sequence] seqall Nucleotide sequence(s) filename and optional
format, or reference (input USA)
[-outseq] seqoutall [.] Sequence set(s)
filename and optional format (output USA)
Additional (Optional) qualifiers:
-minlength integer [4] This is the minimum length that a poly-A
(or poly-T) tail must have before it is
removed. If there are mismatches in the tail
than there must be at least this length of
poly-A tail before the mismatch for the
mismatch to be considered part of the tail.
(Integer 1 or more)
-mismatches integer [1] If there are this number or fewer
contiguous non-A bases in a poly-A tail
then, if there are '-minlength' 'A' bases
before them, they will be considered part of
the tail and removed .
For example the terminal 4 A's of GCAGAAAA
would be removed with the default values of
-minlength=4 and -mismatches=1 (There are
not at least 4 A's before the last 'G' and
so only the A's after it are considered to
be part of the tail). The terminal 9 bases
of GCAAAAGAAAA would be removed; There are
at least -minlength A's preceeding the last
'G', so it is part of the tail. (Integer 0
or more)
-[no]reverse boolean [Y] When a poly-T region at the 5' end of
the sequence is found and removed, it is
likely that the sequence is in the reverse
sense. This option will change the sequence
to the forward sense when it is written out.
If this option is not set, then the sense
will not be changed.
-tolower toggle [N] The poly-A region can be 'masked' by
converting the sequence characters to
lower-case. Some non-EMBOSS programs e.g.
fasta can interpret this as a masked region.
The sequence is unchanged apart from the
case change. You might like to ensure that
the whole sequence is in upper-case before
masking the specified regions to lower-case
by using the '-supper' sequence qualifier.
Advanced (Unprompted) qualifiers:
-[no]fiveprime boolean [Y] If this is set true, then the 5' end of
the sequence is inspected for poly-T tails.
These will be removed if they are longer
than any 3' poly-A tails. If this is false,
then the 5' end is ignored.
Associated qualifiers:
"-sequence" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-scircular1 boolean Sequence is circular
-sformat1 string Input sequence format
-iquery1 string Input query fields or ID list
-ioffset1 integer Input start position offset
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
"-outseq" associated qualifiers
-osformat2 string Output seq format
-osextension2 string File name extension
-osname2 string Base file name
-osdirectory2 string Output directory
-osdbname2 string Database name to add
-ossingle2 boolean Separate file for each entry
-oufo2 string UFO features
-offormat2 string Features format
-ofname2 string Features file name
-ofdirectory2 string Output directory
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write first file to standard output
-filter boolean Read first file from standard input, write
first file to standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options and exit. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
-version boolean Report version number and exit

This is the minimum length that a poly-A (or poly-T) tail must have before it is removed. If there are mismatches in the tail than there must be at least this length of poly-A tail before the mismatch for the mismatch to be considered part of the tail.

Integer 1 or more

4

-mismatches

integer

If there are this number or fewer contiguous non-A bases in a poly-A tail then, if there are '-minlength' 'A' bases before them, they will be considered part of the tail and removed .
For example the terminal 4 A's of GCAGAAAA would be removed with the default values of -minlength=4 and -mismatches=1 (There are not at least 4 A's before the last 'G' and so only the A's after it are considered to be part of the tail). The terminal 9 bases of GCAAAAGAAAA would be removed; There are at least -minlength A's preceeding the last 'G', so it is part of the tail.

Integer 0 or more

1

-[no]reverse

boolean

When a poly-T region at the 5' end of the sequence is found and removed, it is likely that the sequence is in the reverse sense. This option will change the sequence to the forward sense when it is written out. If this option is not set, then the sense will not be changed.

Boolean value Yes/No

Yes

-tolower

toggle

The poly-A region can be 'masked' by converting the sequence characters to lower-case. Some non-EMBOSS programs e.g. fasta can interpret this as a masked region. The sequence is unchanged apart from the case change. You might like to ensure that the whole sequence is in upper-case before masking the specified regions to lower-case by using the '-supper' sequence qualifier.

Toggle value Yes/No

No

Advanced (Unprompted) qualifiers

-[no]fiveprime

boolean

If this is set true, then the 5' end of the sequence is inspected for poly-T tails. These will be removed if they are longer than any 3' poly-A tails. If this is false, then the 5' end is ignored.

Boolean value Yes/No

Yes

Associated qualifiers

"-sequence" associated seqall qualifiers

-sbegin1-sbegin_sequence

integer

Start of each sequence to be used

Any integer value

0

-send1-send_sequence

integer

End of each sequence to be used

Any integer value

0

-sreverse1-sreverse_sequence

boolean

Reverse (if DNA)

Boolean value Yes/No

N

-sask1-sask_sequence

boolean

Ask for begin/end/reverse

Boolean value Yes/No

N

-snucleotide1-snucleotide_sequence

boolean

Sequence is nucleotide

Boolean value Yes/No

N

-sprotein1-sprotein_sequence

boolean

Sequence is protein

Boolean value Yes/No

N

-slower1-slower_sequence

boolean

Make lower case

Boolean value Yes/No

N

-supper1-supper_sequence

boolean

Make upper case

Boolean value Yes/No

N

-scircular1-scircular_sequence

boolean

Sequence is circular

Boolean value Yes/No

N

-sformat1-sformat_sequence

string

Input sequence format

Any string

-iquery1-iquery_sequence

string

Input query fields or ID list

Any string

-ioffset1-ioffset_sequence

integer

Input start position offset

Any integer value

0

-sdbname1-sdbname_sequence

string

Database name

Any string

-sid1-sid_sequence

string

Entryname

Any string

-ufo1-ufo_sequence

string

UFO features

Any string

-fformat1-fformat_sequence

string

Features format

Any string

-fopenfile1-fopenfile_sequence

string

Features file name

Any string

"-outseq" associated seqoutall qualifiers

-osformat2-osformat_outseq

string

Output seq format

Any string

-osextension2-osextension_outseq

string

File name extension

Any string

-osname2-osname_outseq

string

Base file name

Any string

-osdirectory2-osdirectory_outseq

string

Output directory

Any string

-osdbname2-osdbname_outseq

string

Database name to add

Any string

-ossingle2-ossingle_outseq

boolean

Separate file for each entry

Boolean value Yes/No

N

-oufo2-oufo_outseq

string

UFO features

Any string

-offormat2-offormat_outseq

string

Features format

Any string

-ofname2-ofname_outseq

string

Features file name

Any string

-ofdirectory2-ofdirectory_outseq

string

Output directory

Any string

General qualifiers

-auto

boolean

Turn off prompts

Boolean value Yes/No

N

-stdout

boolean

Write first file to standard output

Boolean value Yes/No

N

-filter

boolean

Read first file from standard input, write first file to standard output

Boolean value Yes/No

N

-options

boolean

Prompt for standard and additional values

Boolean value Yes/No

N

-debug

boolean

Write debug output to program.dbg

Boolean value Yes/No

N

-verbose

boolean

Report some/full command line options

Boolean value Yes/No

Y

-help

boolean

Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose

Boolean value Yes/No

N

-warning

boolean

Report warnings

Boolean value Yes/No

Y

-error

boolean

Report errors

Boolean value Yes/No

Y

-fatal

boolean

Report fatal errors

Boolean value Yes/No

Y

-die

boolean

Report dying program messages

Boolean value Yes/No

Y

-version

boolean

Report version number and exit

Boolean value Yes/No

N

Input file format

trimest reads the USA of one or more normal nucleic acid sequences.

Input files for usage example

'tembl:x65923' is a sequence entry in the example nucleic acid database 'tembl'

Output file format

If a poly-A tail is reomved then [poly-A tail removed] is appended to the description of the sequence. If poly-T is removed, then [poly-T tail removed] is appended and if the sequence is reversed, [reverse complement] is appended.

The output is a set of sequences with the poly-A (or poly-T) tails
removed. If a sequence had a 5' poly-T tail then the resulting sequence
is reverse-complemented by default. The description line has a comment
appended about the changes made to the sequence.

Data files

None.

Notes

EST and mRNA sequences often have poly-A tails at their 3' end.
Where an EST sequence is the reverse complement of a corresponding
mRNA's forward sense it may have a poly-T tail at its 5' end.

trimest is not infallible. There are often repeats
of A (or T) in a sequence that just happen by chance
to occur at the 3' (or 5') end of the EST sequence. trimest has
no way of determining if the A's it finds are part of a real poly-A
tail or are a part of the transcribed genomic sequence. It removes any
apparent poly-A tails that match its criteria for a poly-A tail (see
"Algorithm").