Wiki

Function

Draw a threshold dotplot of two sequences

Description

dotmatcher generates a dotplot from two input sequences. The dotplot is an intuitive graphical representation of the regions of similarity between two sequences. All positions from the first input sequence are compared with all positions from the second input sequence using a specified substitution matrix. The two sequences are the axes of the rectangular dotplot. Wherever there is "similarity" between a position from each sequence a dot is plotted. The threshold conditions for "similarity" are defined by the user.

Algorithm

All positions from the first input sequence are compared with all positions from the second input sequence and scored, using the specified substitution matrix. This produces a matrix of scores from which local regions of similarity (corresponding to diagonals in the dotplot) are identified. A window of user-specified length is moved along all possible diagonals. Each position in the window corresponds to a pair-wise score from the scoring matrix. The score for the entire window is the sum of the scores for individual positions within it. If the window score is above the user-defined threshold, then a line is plotted on the dotplot corresponding to the window.

This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation.

Comparison matrix file in EMBOSS data path

EBLOSUM62 for proteinEDNAFULL for DNA

-windowsize

integer

Window size over which to test threshold

Integer 3 or more

10

-threshold

integer

Threshold

Integer 0 or more

23

Advanced (Unprompted) qualifiers

-stretch

toggle

Display a non-proportional graph

Toggle value Yes/No

No

Associated qualifiers

"-asequence" associated sequence qualifiers

-sbegin1-sbegin_asequence

integer

Start of the sequence to be used

Any integer value

0

-send1-send_asequence

integer

End of the sequence to be used

Any integer value

0

-sreverse1-sreverse_asequence

boolean

Reverse (if DNA)

Boolean value Yes/No

N

-sask1-sask_asequence

boolean

Ask for begin/end/reverse

Boolean value Yes/No

N

-snucleotide1-snucleotide_asequence

boolean

Sequence is nucleotide

Boolean value Yes/No

N

-sprotein1-sprotein_asequence

boolean

Sequence is protein

Boolean value Yes/No

N

-slower1-slower_asequence

boolean

Make lower case

Boolean value Yes/No

N

-supper1-supper_asequence

boolean

Make upper case

Boolean value Yes/No

N

-scircular1-scircular_asequence

boolean

Sequence is circular

Boolean value Yes/No

N

-squick1-squick_asequence

boolean

Read id and sequence only

Boolean value Yes/No

N

-sformat1-sformat_asequence

string

Input sequence format

Any string

-iquery1-iquery_asequence

string

Input query fields or ID list

Any string

-ioffset1-ioffset_asequence

integer

Input start position offset

Any integer value

0

-sdbname1-sdbname_asequence

string

Database name

Any string

-sid1-sid_asequence

string

Entryname

Any string

-ufo1-ufo_asequence

string

UFO features

Any string

-fformat1-fformat_asequence

string

Features format

Any string

-fopenfile1-fopenfile_asequence

string

Features file name

Any string

"-bsequence" associated sequence qualifiers

-sbegin2-sbegin_bsequence

integer

Start of the sequence to be used

Any integer value

0

-send2-send_bsequence

integer

End of the sequence to be used

Any integer value

0

-sreverse2-sreverse_bsequence

boolean

Reverse (if DNA)

Boolean value Yes/No

N

-sask2-sask_bsequence

boolean

Ask for begin/end/reverse

Boolean value Yes/No

N

-snucleotide2-snucleotide_bsequence

boolean

Sequence is nucleotide

Boolean value Yes/No

N

-sprotein2-sprotein_bsequence

boolean

Sequence is protein

Boolean value Yes/No

N

-slower2-slower_bsequence

boolean

Make lower case

Boolean value Yes/No

N

-supper2-supper_bsequence

boolean

Make upper case

Boolean value Yes/No

N

-scircular2-scircular_bsequence

boolean

Sequence is circular

Boolean value Yes/No

N

-squick2-squick_bsequence

boolean

Read id and sequence only

Boolean value Yes/No

N

-sformat2-sformat_bsequence

string

Input sequence format

Any string

-iquery2-iquery_bsequence

string

Input query fields or ID list

Any string

-ioffset2-ioffset_bsequence

integer

Input start position offset

Any integer value

0

-sdbname2-sdbname_bsequence

string

Database name

Any string

-sid2-sid_bsequence

string

Entryname

Any string

-ufo2-ufo_bsequence

string

UFO features

Any string

-fformat2-fformat_bsequence

string

Features format

Any string

-fopenfile2-fopenfile_bsequence

string

Features file name

Any string

"-graph" associated graph qualifiers

-gprompt

boolean

Graph prompting

Boolean value Yes/No

N

-gdesc

string

Graph description

Any string

-gtitle

string

Graph title

Any string

Dotmatcher: $(asequence.usa) vs $(bsequence.usa)

-gsubtitle

string

Graph subtitle

Any string

-gxtitle

string

Graph x axis title

Any string

$(asequence.name)

-gytitle

string

Graph y axis title

Any string

$(bsequence.name)

-goutfile

string

Output file for non interactive displays

Any string

-gdirectory

string

Output directory

Any string

"-xygraph" associated xygraph qualifiers

-gprompt

boolean

Graph prompting

Boolean value Yes/No

N

-gdesc

string

Graph description

Any string

-gtitle

string

Graph title

Any string

Dotmatcher: $(asequence.usa) vs $(bsequence.usa)

-gsubtitle

string

Graph subtitle

Any string

-gxtitle

string

Graph x axis title

Any string

$(asequence.name)

-gytitle

string

Graph y axis title

Any string

$(bsequence.name)

-goutfile

string

Output file for non interactive displays

Any string

-gdirectory

string

Output directory

Any string

General qualifiers

-auto

boolean

Turn off prompts

Boolean value Yes/No

N

-stdout

boolean

Write first file to standard output

Boolean value Yes/No

N

-filter

boolean

Read first file from standard input, write first file to standard output

Boolean value Yes/No

N

-options

boolean

Prompt for standard and additional values

Boolean value Yes/No

N

-debug

boolean

Write debug output to program.dbg

Boolean value Yes/No

N

-verbose

boolean

Report some/full command line options

Boolean value Yes/No

Y

-help

boolean

Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose

Boolean value Yes/No

N

-warning

boolean

Report warnings

Boolean value Yes/No

Y

-error

boolean

Report errors

Boolean value Yes/No

Y

-fatal

boolean

Report fatal errors

Boolean value Yes/No

Y

-die

boolean

Report dying program messages

Boolean value Yes/No

Y

-version

boolean

Report version number and exit

Boolean value Yes/No

N

Input file format

dotmatcher reads two nucleotide or protein sequences.

The input is a standard EMBOSS sequence query (also known as a 'USA').

Major sequence database sources defined as standard in EMBOSS
installations include srs:embl, srs:uniprot and ensembl

Data can also be read from sequence output in any supported format
written by an EMBOSS or third-party application.

The input format can be specified by using the
command-line qualifier -sformat xxx, where 'xxx' is replaced
by the name of the required format. The available format names are:
gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir
(nbrf), swissprot (swiss, sw), dasgff and debug.

Output file format

The output is to the specified graphics device.

The results can be output in one of several formats by using the
command-line qualifier -graph xxx, where 'xxx' is replaced by
the name of the required device. Support depends on the availability
of third-party software packages.

Output files for usage example

Graphics File: dotmatcher.ps

Data files

It uses the specified matrix substitution file to compare the two sequences.

For protein sequences EBLOSUM62 is used for the substitution
matrix. For nucleotide sequence, EDNAFULL is used. Others can be
specified.

EMBOSS data files are distributed with the application and stored
in the standard EMBOSS data directory, which is defined
by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your
current directory for you to inspect or modify, run:

% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories.
Project specific files can be put in the current directory, or for
tidier directory listings in a subdirectory called
".embossdata". Files for all EMBOSS runs can be put in the user's home
directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

. (your current directory)

.embossdata (under your current directory)

~/ (your home directory)

~/.embossdata

Notes

Where the two sequences have substantial regions of similarity, the dots line up to form diagonal lines. It is possible to see at a glance such local regions of similarity. It is also easy to see other features such as repeats (which form parallel diagonal lines), and insertions or deletions (which form breaks or discontinuities in the diagonal lines).