Wiki

Function

Needleman-Wunsch global alignment of two sequences

Description

needle reads two input sequences and writes their optimal global sequence alignment to file. It uses the Needleman-Wunsch alignment algorithm to find the optimum alignment (including gaps) of two sequences along their entire length. The algorithm uses a dynamic programming method to ensure the alignment is optimum, by exploring all possible alignments and choosing the best. A scoring matrix is read that contains values for every possible residue or nucleotide match. Needle finds the alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix, minus penalties arising from opening and extending gaps in the aligned sequences. The substitution matrix and gap opening and extension penalties are user-specified.

Algorithm

The Needleman-Wunsch algorithm is a member of the class of algorithms that can calculate the best score and alignment of two sequences in the order of mn steps, where n and m are the sequence lengths. These dynamic programming algorithms were first developed for protein sequence comparison by Needleman and Wunsch, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.

An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty). Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.

Penalty for a gap of n positions is calculated using the following formula:

gap opening penalty + (n - 1) * gap extension penalty

In a Needleman-Wunsch global alignment, the entire length of each sequence is aligned. The sequences might be partially overlapping or one sequence might be aligned entirely internally to the other. There is no penalty for the hanging ends of the overlap. In bioinformatics, it is usually reasonable to assume that the sequences are incomplete and there should be no penalty for failing to align the missing bases.

Command line arguments

Needleman-Wunsch global alignment of two sequences
Version: EMBOSS:6.3.0
Standard (Mandatory) qualifiers:
[-asequence] sequence Sequence filename and optional format, or
reference (input USA)
[-bsequence] seqall Sequence(s) filename and optional format, or
reference (input USA)
-gapopen float [10.0 for any sequence] The gap open penalty
is the score taken away when a gap is
created. The best value depends on the
choice of comparison matrix. The default
value assumes you are using the EBLOSUM62
matrix for protein sequences, and the
EDNAFULL matrix for nucleotide sequences.
(Floating point number from 1.0 to 100.0)
-gapextend float [0.5 for any sequence] The gap extension,
penalty is added to the standard gap penalty
for each base or residue in the gap. This
is how long gaps are penalized. Usually you
will expect a few long gaps rather than many
short gaps, so the gap extension penalty
should be lower than the gap penalty. An
exception is where one or both sequences are
single reads with possible sequencing
errors in which case you would expect many
single base gaps. You can get this result by
setting the gap open penalty to zero (or
very low) and using the gap extension
penalty to control gap scoring. (Floating
point number from 0.0 to 10.0)
[-outfile] align [*.needle] Output alignment file name
Additional (Optional) qualifiers:
-datafile matrixf [EBLOSUM62 for protein, EDNAFULL for DNA]
This is the scoring matrix file used when
comparing sequences. By default it is the
file 'EBLOSUM62' (for proteins) or the file
'EDNAFULL' (for nucleic sequences). These
files are found in the 'data' directory of
the EMBOSS installation.
-endweight boolean [N] Apply end gap penalties.
-endopen float [10.0 for any sequence] The end gap open
penalty is the score taken away when an end
gap is created. The best value depends on
the choice of comparison matrix. The default
value assumes you are using the EBLOSUM62
matrix for protein sequences, and the
EDNAFULL matrix for nucleotide sequences.
(Floating point number from 1.0 to 100.0)
-endextend float [0.5 for any sequence] The end gap
extension, penalty is added to the end gap
penalty for each base or residue in the end
gap. This is how long end gaps are
penalized. (Floating point number from 0.0
to 10.0)
Advanced (Unprompted) qualifiers:
-[no]brief boolean [Y] Brief identity and similarity
Associated qualifiers:
"-asequence" associated qualifiers
-sbegin1 integer Start of the sequence to be used
-send1 integer End of the sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
"-bsequence" associated qualifiers
-sbegin2 integer Start of each sequence to be used
-send2 integer End of each sequence to be used
-sreverse2 boolean Reverse (if DNA)
-sask2 boolean Ask for begin/end/reverse
-snucleotide2 boolean Sequence is nucleotide
-sprotein2 boolean Sequence is protein
-slower2 boolean Make lower case
-supper2 boolean Make upper case
-sformat2 string Input sequence format
-sdbname2 string Database name
-sid2 string Entryname
-ufo2 string UFO features
-fformat2 string Features format
-fopenfile2 string Features file name
"-outfile" associated qualifiers
-aformat3 string Alignment format
-aextension3 string File name extension
-adirectory3 string Output directory
-aname3 string Base file name
-awidth3 integer Alignment width
-aaccshow3 boolean Show accession number in the header
-adesshow3 boolean Show description in the header
-ausashow3 boolean Show the full USA in the alignment
-aglobal3 boolean Show the full sequence in alignment
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write first file to standard output
-filter boolean Read first file from standard input, write
first file to standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options and exit. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
-version boolean Report version number and exit

Qualifier

Type

Description

Allowed values

Default

Standard (Mandatory) qualifiers

[-asequence](Parameter 1)

sequence

Sequence filename and optional format, or reference (input USA)

Readable sequence

Required

[-bsequence](Parameter 2)

seqall

Sequence(s) filename and optional format, or reference (input USA)

Readable sequence(s)

Required

-gapopen

float

The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences.

Floating point number from 1.0 to 100.0

10.0 for any sequence

-gapextend

float

The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring.

Floating point number from 0.0 to 10.0

0.5 for any sequence

[-outfile](Parameter 3)

align

Output alignment file name

Alignment output file

<*>.needle

Additional (Optional) qualifiers

-datafile

matrixf

This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation.

Comparison matrix file in EMBOSS data path

EBLOSUM62 for proteinEDNAFULL for DNA

-endweight

boolean

Apply end gap penalties.

Boolean value Yes/No

No

-endopen

float

The end gap open penalty is the score taken away when an end gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences.

Floating point number from 1.0 to 100.0

10.0 for any sequence

-endextend

float

The end gap extension, penalty is added to the end gap penalty for each base or residue in the end gap. This is how long end gaps are penalized.

Floating point number from 0.0 to 10.0

0.5 for any sequence

Advanced (Unprompted) qualifiers

-[no]brief

boolean

Brief identity and similarity

Boolean value Yes/No

Yes

Associated qualifiers

"-asequence" associated sequence qualifiers

-sbegin1-sbegin_asequence

integer

Start of the sequence to be used

Any integer value

0

-send1-send_asequence

integer

End of the sequence to be used

Any integer value

0

-sreverse1-sreverse_asequence

boolean

Reverse (if DNA)

Boolean value Yes/No

N

-sask1-sask_asequence

boolean

Ask for begin/end/reverse

Boolean value Yes/No

N

-snucleotide1-snucleotide_asequence

boolean

Sequence is nucleotide

Boolean value Yes/No

N

-sprotein1-sprotein_asequence

boolean

Sequence is protein

Boolean value Yes/No

N

-slower1-slower_asequence

boolean

Make lower case

Boolean value Yes/No

N

-supper1-supper_asequence

boolean

Make upper case

Boolean value Yes/No

N

-sformat1-sformat_asequence

string

Input sequence format

Any string

-sdbname1-sdbname_asequence

string

Database name

Any string

-sid1-sid_asequence

string

Entryname

Any string

-ufo1-ufo_asequence

string

UFO features

Any string

-fformat1-fformat_asequence

string

Features format

Any string

-fopenfile1-fopenfile_asequence

string

Features file name

Any string

"-bsequence" associated seqall qualifiers

-sbegin2-sbegin_bsequence

integer

Start of each sequence to be used

Any integer value

0

-send2-send_bsequence

integer

End of each sequence to be used

Any integer value

0

-sreverse2-sreverse_bsequence

boolean

Reverse (if DNA)

Boolean value Yes/No

N

-sask2-sask_bsequence

boolean

Ask for begin/end/reverse

Boolean value Yes/No

N

-snucleotide2-snucleotide_bsequence

boolean

Sequence is nucleotide

Boolean value Yes/No

N

-sprotein2-sprotein_bsequence

boolean

Sequence is protein

Boolean value Yes/No

N

-slower2-slower_bsequence

boolean

Make lower case

Boolean value Yes/No

N

-supper2-supper_bsequence

boolean

Make upper case

Boolean value Yes/No

N

-sformat2-sformat_bsequence

string

Input sequence format

Any string

-sdbname2-sdbname_bsequence

string

Database name

Any string

-sid2-sid_bsequence

string

Entryname

Any string

-ufo2-ufo_bsequence

string

UFO features

Any string

-fformat2-fformat_bsequence

string

Features format

Any string

-fopenfile2-fopenfile_bsequence

string

Features file name

Any string

"-outfile" associated align qualifiers

-aformat3-aformat_outfile

string

Alignment format

Any string

srspair

-aextension3-aextension_outfile

string

File name extension

Any string

-adirectory3-adirectory_outfile

string

Output directory

Any string

-aname3-aname_outfile

string

Base file name

Any string

-awidth3-awidth_outfile

integer

Alignment width

Any integer value

0

-aaccshow3-aaccshow_outfile

boolean

Show accession number in the header

Boolean value Yes/No

N

-adesshow3-adesshow_outfile

boolean

Show description in the header

Boolean value Yes/No

N

-ausashow3-ausashow_outfile

boolean

Show the full USA in the alignment

Boolean value Yes/No

N

-aglobal3-aglobal_outfile

boolean

Show the full sequence in alignment

Boolean value Yes/No

Y

General qualifiers

-auto

boolean

Turn off prompts

Boolean value Yes/No

N

-stdout

boolean

Write first file to standard output

Boolean value Yes/No

N

-filter

boolean

Read first file from standard input, write first file to standard output

Boolean value Yes/No

N

-options

boolean

Prompt for standard and additional values

Boolean value Yes/No

N

-debug

boolean

Write debug output to program.dbg

Boolean value Yes/No

N

-verbose

boolean

Report some/full command line options

Boolean value Yes/No

Y

-help

boolean

Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose

Boolean value Yes/No

N

-warning

boolean

Report warnings

Boolean value Yes/No

Y

-error

boolean

Report errors

Boolean value Yes/No

Y

-fatal

boolean

Report fatal errors

Boolean value Yes/No

Y

-die

boolean

Report dying program messages

Boolean value Yes/No

Y

-version

boolean

Report version number and exit

Boolean value Yes/No

N

Input file format

needle reads any 2 sequence USAs of the same type (DNA or protein).

Input files for usage example

'tsw:hba_human' is a sequence entry in the example protein database 'tsw'

Output file format

The output is a standard EMBOSS alignment file.

The results can be output in one of several styles by using the
command-line qualifier -aformat xxx, where 'xxx' is replaced by
the name of the required format. Some of the alignment formats can cope
with an unlimited number of sequences, while others are only for pairs
of sequences.

The Identity: is the percentage of identical matches between the two
sequences over the reported aligned region (including any gaps in the length).

The Similarity: is the percentage of matches between the two
sequences over the reported aligned region (including any gaps in the length).

Data files

For protein sequences EBLOSUM62 is used for the substitution
matrix. For nucleotide sequence, EDNAFULL is used. Others can be
specified.

EMBOSS data files are distributed with the application and stored
in the standard EMBOSS data directory, which is defined
by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your
current directory for you to inspect or modify, run:

% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories.
Project specific files can be put in the current directory, or for
tidier directory listings in a subdirectory called
".embossdata". Files for all EMBOSS runs can be put in the user's home
directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

. (your current directory)

.embossdata (under your current directory)

~/ (your home directory)

~/.embossdata

Notes

needle is a true implementation of the Needleman-Wunsch
algorithm and so produces a full path matrix. It therefore cannot be
used with genome sized sequences unless you've a lot of memory and a
lot of time.

Warnings

needle is for aligning two sequences over their entire length. This
works best with closely related sequences. If you use needle to align
very distantly-related sequences, it will produce a result but much of
the alignment may have little or no biological significance.

A true Needleman Wunsch implementation like needle needs memory
proportional to the product of the sequence lengths. For two sequences
of length 10,000,000 and 1,000 it therefore needs memory proportional to
10,000,000,000 characters. Two arrays of this size are produced, one of
ints and one of floats so multiply that figure by 8 to get the memory
usage in bytes. That doesn't include other overheads. Therefore only
use water and needle for accurate alignment of reasonably short
sequences.

If you run out of memory, try using
stretcher instead.

Diagnostic Error Messages

Uncaught exception
Assertion failed
raised at ajmem.c:xxx

Probably means you have run out of memory. Try using
stretcher if this happens.