FINDREP(1) OGMP ANALYSIS TOOL FINDREP(1)
NAME
findrep - find exact repeated subsequences in a masterfile
SYNOPSIS
findrep [-v] [-s rep_size] [-h sub_size] [-C]
[-S strand] [-B contigs] [-H] [-G] [-Q] masterfile
DESCRIPTION
findrep analyses a masterfile (that is, a file in mf(5) format) and
reports occurences of identical repeat subsequences whose length
is greater than or equal some value (by default, 100 bases).
OPTIONS
-s rep_size
rep_size is the minimum size of the repeats subsequences to report;
by default is is 100 bases.
-h sub_size
sub_size is an internal value specific to the length of the
subsequences used during the hashing part of the processing.
Its default value is the same as rep_size (as specified with
the -s command line option). sub_size cannot have a value greater
than rep_size. Memory space used by the program increase roughly
with the product sub_size * size_of_masterfile (where
size_of_masterfile is the total number of bases in the masterfile).
Therefore, for large genome, is can be more memory efficient to
reduce sub_size to less than rep_size, at the cost of a little
bit more processing time when the hashing is done.
-v Verbose option. Prints more information during the processing
phase; use this when you want to see how quickly (or slowly)
the analysis proceeds on a particular platform. For even more
info, try -v2 (no space between the v and the 2).
-C Coordinates file format. The output will be in coordinates(5)
file format rather than FASTA.
-S strand
Restrict the strandedness of the search. The argument strand
can be one of the following keywords:
- "forward", "noncomp" for the forward strand;
- "reverse", "comp" for the reverse strand.
- "both", "any" for both strands;
Actually, only the first letter of the keyword is looked at;
thus the options "-S both" can be written as "-S b" or "-Sb".
Restricing the strandedness to a single strand roughly halves the
amount of memory required during processing. The default behavior
is to search both strands.
-B contigs
Retrict the reporting of the repeats. You can ignore repeats
that are in the same contig or ignore repeats that are in
different contigs. The argument contigs can be one of the
following keywords:
- "any" allow repeats between any contigs.
- "same" only allow repeats in the same contig.
- "different" only allow repeats in different contigs.
-H Reports only hairpin repeats (subsequences like ATACTGCAGTAT which
are their own repeat when reversed and complemented). This
option forces the options "-S both" and "-B same".
-G Graphical option. In the reports, a sized down representation of
the contigs with the location of the repeats wil be inserted.
See the OUTPUT section for more details.
-Q Quiet mode. When present, disable ALL output, except for the
printing of a single integer which is the longuest repeat
of length greater than or equal to rep_size. The program will
output "0" (zero) if no repeat of at least rep_size was found.
OUTPUT
The information displayed by findrep consists in a header section that
identify the parameters used during invocation followed by zero, one or
many repeat reports.
A repeat report typically consists of 6 parts, numbered (1) to (6) in
the sample report below:
(1) >Repeat 2: forward(304-406) reverse(111-213) (103)
(2) ;;> R.youthere unknown reading frame #1
(3) ;;> P.terpiper mitochondrial fragment
(4) ;; >>>>>>>>>>>>---->>>>>>>>>>>>>
(5) ;; <<<Repeat 2: forward(304-406) reverse(111-213) (103)
This is a fasta header; repeats are numbered sequentially as they
are printed out. The "forward(304-406)" string identifies the first
subsequence match for this repeat; "forward" means that this
match occured in the contig as read by the program in the
masterfile. The "reverse(111-213)" string identifies the second
subsequence match for this repeat; "reverse" means that this match
occured in the contig on the opposite strand as the one given by the
masterfile. ALL range positions (like "304-406" and "111-213") are
relative to the beginning of the contig in the direction given by
the masterfile, no matter whether the match is "forward" or
"reverse". The last piece of information, "(103"), is the length
of the repeat.
(2) ;;> R.youthere unknown reading frame #1
(3) ;;> P.terpiper mitochondrial fragment
These lines identify the contigs in which the repeats were found.
Line (2) is the name of the first contig (in our example, the
one where the "forward(304-406)" match occured) and line (3)
is the line of the second contig (where the "reverse(111-213)"
match occured). If the two matches occured in the same contig,
then line (3) is not displayed in the output.
(4) ;; >>>>>>>>>>>>---->>>>>>>>>>>>>
(5) ;; <<<" signs when the match was found on the forward strand and
with "