FUNCTION

FastA does a Pearson and
Lipman search for similarity between a query sequence and a group of sequences
of the same type (nucleic acid or protein). For nucleotide searches, FastA may
be more sensitive than BLAST.

DESCRIPTION

FastA uses the method of Pearson
and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to search
for similarities between one sequence (the query) and any group of
sequences of the same type (nucleic acid or protein) as the query sequence.

In the first step of this search, the comparison can be viewed as a set of
dot plots, with the query as the vertical sequence and the group of sequences
to which the query is being compared as the different horizontal sequences.
This first step finds the registers of comparison (diagonals) having the
largest number of short perfect matches (words) for each comparison. In
the second step, these "best" regions are rescored using a scoring
matrix that allows conservative replacements, ambiguity symbols, and runs of
identities shorter than the size of a word. In the third step, the program
checks to see if some of these initial highest-scoring diagonals can be joined
together. Finally, the search set sequences with the highest scores are aligned
to the query sequence for display.

What is a Word?

A word is any short sequence (n-mer or k-tuple) where you have set n to some
small integer less than or equal to six. The word GGATGG is one of the 4,096
possible words of length six that can be created from an alphabet consisting of
the four letters G, A, T, and C. The word QL is one of the 400 possible words
of length two that you can make with the 20 letters of the amino acid alphabet.

OUTPUT

The output from FastA is a list file,
and is suitable for input to any GCG program that allows indirect file
specifications. (For information about indirect file specification, see Section
2, Using Sequence Files and Databases of the User's Guide.)

The first part of the output file contains a histogram showing the distribution
of the z-scores between the query and search set sequences. (See the
ALGORITHM topic for an explanation of z-score.) The histogram is
composed of bins of size 2 that are labeled according to the higher score for
that bin (the leftmost column of the histogram). For example, the bin labeled
24 stores the number of sequence pairs that had scores of 23 or 24.

The
next two columns of the histogram list the number of z-scores that fell
within each bin. The second column lists the number of z-scores observed
in the search and the third column lists the number of z-scores that were
expected.

The
body of the histogram displays a graphical representation of the score
distributions. Equal signs (=) indicate the number of scores of that magnitude
that were observed during the search, while asterisks (*) plot the number of
scores of that magnitude that were expected.

At
the bottom of the histogram is a list of some of the parameters pertaining to
the search.

Below
the histogram, FastA displays a listing of the best scores.Strand:-
after the sequence name in this list indicates that the match was found between
the search set sequence and the reverse complement of the query sequence.

Following
the list of best scores, FastA displays the alignments of the regions of best
overlap between the query and search sequences.
/rev following the query sequence name indicates that the search
sequence is aligned with the reverse complement of the query sequence.

This
program displays only the region of overlap between the two aligned sequences
(plus some residues on either side of the region to provide context for the
alignment) unless you use -SHOWall. The display of identities and
conservative replacements between the aligned sequences depends on the value of -MARKx.
By default ( -MARKx=3), the pipe character (|) is used
to denote identities and the colon (:) to denote conservative replacements.

INPUT FILES

FastA accepts a single protein
sequence or a single nucleic acid sequence as the query sequence. The search
set is either a single sequence or multiple sequences of the same type as the
query. You can specify multiple sequences in a number of ways: by using a list
file, for example @project.list;
by using an MSF or RSF file, for example
project.msf{*}; or by using a sequence specification with an
asterisk (*) wildcard, for
example GenBank:*. The function
of FastA depends on whether your input sequence(s) are protein or nucleotide.
Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above
the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set
the type of a sequence.

RELATED PROGRAMS

FastA+
does a Pearson and Lipman search for similarity between a query sequence and a
group of sequences of the same type (nucleic acid or protein). For nucleotide
searches, FastA+ may be more sensitive than BLAST+.

BLAST searches one or more nucleic acid or protein
databases for sequences similar to one or more query sequences of any type.
BLASTcan produce gapped alignments for the
matches it finds. NetBLAST searches for sequences similar to a query sequence.
The query and the database searched can be either peptide or nucleic acid in
any combination. NetBLAST can search only databases
maintained at the NationalCenter
for Biotechnology Information (NCBI) in Bethesda,
Maryland, USA.

SSearch does a rigorous Smith-Waterman search for
similarity between a query sequence and a group of sequences of the same type
(nucleic acid or protein). This may be the most sensitive method available for
similarity searches. Compared to BLAST and FastA, it
can be very slow.

TFastA does a Pearson and Lipman search for similarity
between a protein query sequence and any group of nucleotide sequences. TFastAtranslates the nucleotide sequences in all six
reading frames before performing the comparison. It is designed to answer the
question, "What implied protein sequences in a nucleotide sequence
database are similar to my protein sequence?"

TFastX does a Pearson and Lipman search for similarity
between a protein query sequence and any group of nucleotide sequences, taking
frameshifts into account. It is designed to be a replacement for TFastA, and
like TFastA, it is designed to answer the question,
"What implied protein sequences in a nucleotide sequence database are
similar to my protein sequence?"

FastX does a Pearson and Lipman search for similarity
between a nucleotide query sequence and a group of protein sequences, taking
frameshifts into account. FastXtranslates
both strands of the nucleic sequence before performing the comparison. It is
designed to answer the question, "What implied protein sequences in my
nucleic acid sequence are similar to sequences in a protein database?"

FrameSearch searches a group of protein sequences
for similarity to one or more nucleotide query sequences, or searches a group
of nucleotide sequences for similarity to one or more protein query sequences.
For each sequence comparison, the program finds an optimal alignment between
the protein sequence and all possible codons on each strand of the nucleotide
sequence. Optimal alignments may include reading frame shifts.

WordSearch identifies sequences in the database that
share large numbers of common words in the same register of comparison with
your query sequence. The output of WordSearch can
be displayed with Segments.

ProfileSearch and MotifSearch
use a profile (derived from a set of aligned sequences) instead of a
query sequence to search a collection of sequences. FindPatterns
uses a pattern described by a regular expression to search a collection
of sequences.

BLAST+ searches one or more nucleic acid or protein
databases for sequences similar to one or more query sequences of any type. BLASTcan produce gapped alignments for the matches it
finds. NetBLAST searches for sequences similar to a query sequence. The query
and the database searched can be either peptide or nucleic acid in any
combination. NetBLAST can search only databases maintained at the NationalCenter for Biotechnology
Information (NCBI) in Bethesda, Maryland,
USA.

SSearch+ does a rigorous Smith-Waterman search for
similarity between a query sequence and a group of sequences of the same type
(nucleic acid or protein). This may be the most sensitive method available for
similarity searches. Compared to BLAST and FastA, it can be very slow.

TFastA+ does a Pearson and Lipman search for similarity
between a protein query sequence and any group of nucleotide sequences. TFastAtranslates the nucleotide sequences in all six
reading frames before performing the comparison. It is designed to answer the
question, "What implied protein sequences in a nucleotide sequence database
are similar to my protein sequence?"

TFastX+ does a Pearson and Lipman search for similarity
between a protein query sequence and any group of nucleotide sequences, taking
frameshifts into account. It is designed to be a replacement for TFastA, and
like TFastA, it is designed to answer the question, "What implied protein
sequences in a nucleotide sequence database are similar to my protein
sequence?"

FastX+ does a Pearson and Lipman search for similarity
between a nucleotide query sequence and a group of protein sequences, taking
frameshifts into account. FastXtranslates
both strands of the nucleic sequence before performing the comparison. It is
designed to answer the question, "What implied protein sequences in my
nucleic acid sequence are similar to sequences in a protein database?"

RESTRICTIONS

The query sequence cannot be longer
than 32,000 symbols. You cannot select a list size of more than 1,000 best
scores nor view more than 1,000 alignments. The word size must be from 1 to 6
for nucleic acid queries, and from 1 to 2 for protein queries. The sequence
type (nucleic acid or protein) of the query sequence and the search set
sequences must match.

For
the estimates of statistical significance to be valid, the search set must
contain a large sample of unrelated sequences. The statistical estimates will
not be calculated at all if there are fewer 10 sequences in the search set (20
sequences if only one strand is searched).

With -NOOPTall, the estimates of statistical significance will not be
accurate.

ALGORITHM

FastA uses the method of
Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to
search for similarities between one sequence (the query) and any group
of sequences.

Hashing
Step

The first step in the search is to locate regions of the query sequence and the
search set sequence that have high densities of exact word matches. The
algorithm for this step of the search is a modification of the algorithm of
Wilbur and Lipman (Proc. Natl. Acad. Sci. USA 80; 726-730 (1983)) and
may be referred to as a hash-table look-up search or hashing.
Wilbur and Lipman searches (including FastA) belong to a class of comparisons
that use direct addressing or k-tuple preprocessing to increase
the speed of the search at the expense of some sensitivity.

The
hashing process works as follows. After you give FastA a word size, it makes up
a dictionary of all of the possible words of that size in the query sequence. A
second dictionary is created for the opposite strand if the query is a nucleic
acid sequence. Each word, such as GGATGG, is converted to a unique base-4
number that serves as an index to the corresponding dictionary entry. Each
entry contains a list of numbers telling the location (coordinates) of the word
in the query sequence. If the word does not occur, the entry contains only the
number zero. So for each word in the searched sequences, FastA only has to look
up the word in the dictionary to find out if it occurs in the query sequence.

It
is important to realize that the hashing process cannot deal with ambiguity!
To partially compensate for this limitation, FastA converts an ambiguous base
in a sequence to its most common nonambiguous constituent before calculating
the index number of the word that contains the ambiguity. For example, A is the
most common nucleotide in the sequence databases, so N is converted to A during
the hashing step. Similarly, the ambiguous amino acids B, Z, and X are
converted to their most common unambiguous constituent, so B (D or N) gets the
same hash code as N, and X (any amino acid) gets the same hash code as alanine,
the most common amino acid in the protein databases.

If
a word from a search set sequence occurs in the query sequence, FastA computes
a score for the word equal to the sum of the scoring factors (see next
paragraph) for each symbol in the word. It then adds this score to the score of
the diagonal on which the word occurs. If a word match overlaps another word on
the same diagonal, only the scoring factor(s) for the non-overlapping symbol(s)
is added to the score of the diagonal. If there are intervening mismatches
between matching words on a diagonal, a constant penalty for each mismatching
residue is subtracted from the score.

When -PAMfactor
is in effect (the default for protein query sequences), the scoring factors
used to score a word are the identical match scores of the scoring matrix used.
Thus a word that contains relatively immutable amino acids will add a larger
score to the diagonal than a word which contains amino acids which can exchange
readily. The default for a nucleic acid query sequence is -NOPAMfactor. In this case, a single constant value is used for
all symbol matches, so all words contribute the same score. The program
defaults can be overridden using -PAMfactor or -NOPAMfactor.

Scoring
Step

At the end of the hashing step, the ten highest-scoring regions for the
sequence pair (the regions with the highest density of exact word matches) are
rescored using a scoring matrix that allows conservative replacements and runs
of identities shorter than the size of a word. The ends of each region are
trimmed to include only those residues that contribute to the highest score for
the region, resulting in ten partial alignments without gaps. These are
referred to as the initial regions. The score of the highest scoring initial
region is saved as the init1 score.

Next,
FastA determines if any of the initial regions from different diagonals may be
joined together to form an approximate alignment with gaps. Only
non-overlapping regions may be joined. The score for the joined regions is the
sum of the scores of the initial regions minus a joining penalty for each gap. The
score of the highest scoring region at the end of this step is saved as the
initn score.

Aligning
Step

After computing the initial scores, FastA determines the best segment of similarity
between the query sequence and the search set sequence, using a variation of
the Smith-Waterman algorithm. This "local alignment in a band"
procedure is described in Chao, Pearson, and Miller (CABIOS 8; 481-487
(1992)). The score for this alignment is reported as the opt score.

By
default, FastA determines the opt score immediately if the initn score is
greater than a given threshold. The opt scores are then used as the basis for
keeping a list of the best matches found. The program calculates the default
threshold from the length of the query sequence and the ktup setting. You can
override this threshold by adding a positive, nonzero number after -OPTall,
for example -OPTall=20. A threshold of 1 is the most
sensitive setting. Setting the threshold higher than this will speed up the
search, at the risk of missing some matches.

Alternatively,
you can use -NOOPTall to direct the program to use the initn
scores as the basis for retaining the best matches. In this case, the opt
scores are calculated for the matches with the best initn scores only after all
of the search set sequences have been scanned. This speeds up the search, but
at the cost of sensitivity, and the statistical estimates for such a search will
not be valid. When -NOOPTall is specified, the best scores are
sorted and reported in order of their initn scores, even though the opt score
is calculated.

Lastly,
FastA uses a simple linear regression against the natural log of the search set
sequence length to calculate a normalized z-score for the sequence pair.
(See William R. Pearson, Protein Science 4; 1145-1160 (1995) for an
explanation of how this z-score is calculated.) By default, the z-score is
calculated from the opt score; with
-NOOPTall, the
z-score is calculated from the initn score instead.

The
distribution of the z-scores tends to closely approximate an extreme-value
distribution; using this distribution, the program can estimate the number of
sequences that would be expected to have, purely by chance, a z-score greater
than or equal to the z-score obtained in the search. This is reported as the
E() score.

When
all of the search set sequences have been compared to the query, the list of
best scores is printed. If alignments were requested, the alignments are also
printed. For searches with a protein query sequence against a protein search
set, a full Smith-Waterman local alignment (not restricted to a band, and
therefore allowing unlimited gap lengths) is performed, and a Smith-Waterman score
is reported along with the other scores and the alignment itself. (This
alignment may not be the same alignment that the "local alignment in a
band" algorithm used to calculate the opt score during the search.) By
default, the alignment for nucleic acid searches and TFastA
is the same "local alignment in a band" that was performed to
calculate the opt score. With -SWalign, you can make the program perform
the full Smith-Waterman alignment on nucleic acid sequences at the cost of
increased computation time.

In
evaluating the E() scores, the following rules of thumb can be used: for
searches of a protein database of 10,000 sequences, sequences with E() less
than 0.01 are almost always found to be homologous. Sequences with E() between
1 and 10 frequently turn out to be related as well. Optimization is important:
with -NOOPTall, E() overestimates the significance of
the match, so that unrelated nucleic acid sequences frequently have scores less
than 0.0005.

A
detailed description of the FastA algorithm is William R. Pearson, "Rapid
and Sensitive Sequence Comparison with FASTP and FASTA," in Methods in
Enzymology, 183; 63-98, Academic Press, San Diego, California, USA,
1990.

CONSIDERATIONS

Accelrys GCG (GCG) version
of FastA searches using both strands of nucleic acid queries unless you use -ONEstrand.
Dr. Pearson's FASTA searches with one strand only.

The
E() scores are affected by similarities in sequence composition between the
query sequence and the search set sequence. Unrelated sequences may have
"significant" scores because of composition bias.

If
there is a database entry that overlaps your query in several places, but there
are large gaps between the matching regions, only the best overlap appears in
the alignment display.

There
are two ways to control the size of the list of best scores. By default, scores
are listed until a specific E() value is reached. You may set the value in
response to the program prompt or by using
-EXPect;
otherwise the program uses 10.0 for protein searches, 2.0 for nucleic acid
searches. (If you are running the program interactively, it will show no more
than 40 scores initially, and ask if you want to see more scores if there are
any more that are less than the E() value.)

If
you use -LIStsize, the E() value is ignored, and the
program will list the number of scores you requested.

You
can control the number of alignments using
-NOALIgn and -ALIgn.
The program behaves differently depending on whether it is being run
noninteractively (in batch or with -Default on the command line) or
interactively. In the noninteractive case, the program displays the number of
alignments set by -ALIgn. (If this is not present, it shows 40
alignments or the number of scores that were listed, whichever is smaller.) If
you run the program interactively, it displays the list of best scores, then
asks you how many alignments you want to see. (This prompt does not appear if
you use -NOALIgn or
-ALIgn.)

Increasing
Sensitivity By Adjusting Word Size

By default, FastA uses the maximum allowable word size in order to maximize the
speed of the search. But in some situations this may not be sensitive enough to
find matches to your query. In particular, if your query sequence is a short
oligonucleotide or peptide and/or the query contains ambiguous residues, you
may need to use -WORdsize to reduce the word size that is used
during the hashing step. (A smaller word size will increase the sensitivity of
the search at the expense of increasing the amount of CPU time required to run
the program.)

If
FastA finds few or no matches for short query sequences, rerun the search using
a word size of 2 or 3 (for oligonucleotides) or a word size of 1 (for short
peptides). Because of the way ambiguous residues are treated during the
hashing stage of the search, you should not use a word size larger than the
longest run of nonambiguous residues in your query sequence .

Adjusting
Gap Creation and Extension Penalties

Unlike other GCG programs, FastA does not read the default gap creation and gap
extension penalties from the scoring matrix file. It uses default gap creation and
extension penalties that were empirically determined to be appropriate for the
default scoring matrices. If you select a different scoring matrix with -MATRix,
you may need to change the gap penalties. The histogram display gives a
qualitative view of the quality of fit between the actual distribution of
scores and the expected distribution of scores. This information may indicate
whether or not suitable gap creation and extension penalties were used for the
search. When the histogram shows poor agreement between the actual distribution
and the theoretical distribution, you might consider using -GAPweight
and/or -LENgthweight to specify higher gap creation
and extension penalties, respectively. For example, you might increase the gap
creation penalty from 12 to 16 and the gap extension penalty from 2 to 4.

Differences
in Applying Gap Extension Penalties

There are two different philosophies on how to penalize gaps in an alignment. One
way is to penalize a gap by the gap creation penalty plus the extension penalty
times the length of the gap (gapweight +
(lengthweight x gap length)). The other way is to use the gap
creation penalty plus the extension penalty times the gap length excluding
the first residue in the gap (gapweight
+ (lengthweight x (gap length - 1)).

"Native"
GCG programs, such as Framesearch and Bestfit, handle gap extension penalties
the first way, while the FastA-family programs use the second way. Therefore a
value for -LENgthweight that gives good results with one
of the FastA-family programs may not give equivalent results with a native GCG
program, and vice versa.

Increasing
Program Speed Using Multithreading

This program is multithreaded. It has the potential to run faster on a machine
equipped with multiple processors because different parts of the analysis can
be run in parallel on different processors. By default, the program assumes you
have one processor, so the analysis is performed using one thread. You can use -PROCessors
to increase the number of threads up to the number of physical processors on
the computer.

Under
ideal conditions, the increase in speed is roughly linear with the number of
processors used. But conditions are rarely ideal. If your computer is heavily
used, competition for the processors can reduce the program's performance. In
such an environment, try to run multithreaded programs during times when the
load on the system is light.

As
the number of threads increases, the amount of memory required increases
substantially. You may need to ask your system administrator to increase the
memory quota for your account if you want to use more than two threads.

Never
use -PROCessors to set the number of threads higher
than the number of physical processors that the machine has -- it does not
increase program performance, but instead uses up a lot of memory needlessly
and makes it harder for other users on the system to get processor time. Ask
your system administrator how many processors your computer has if you aren't
sure.

SUGGESTIONS

If you want to search a single database division instead of an entire database,
see the "Using Database Sequences" topic of Section 2, Using Sequence
Files and Databases of the User's Guide for a list of the logical names used
for the databases and the divisions of each database. The search set can also
consist of a group of sequence files that are not in a database. Use a multiple
sequence specification to name these. For information about naming groups of
sequences for the search set, see the topics "Specifying Files" and
"Using Wildcards" in Section 1, Getting Started, and "Using
Database Sequences," "Using Multiple Sequence Format (MSF)
Files", "Using Rich Sequence Format (RSF) Files", and
"Using List Files" in Section 2, Using Sequence Files and Databases
of the User's Guide.

Batch
Queue

FastA is one of the few programs in GCG that can take more than a few
minutes to run. Most comparisons should probably be run in the batch queue. You
can specify that this program run at a later time in the batch queue by using -BATch.
Run this way, the program prompts you for all the required parameters and then
automatically submits itself to the batch or at queue. For more information,
see "Using
the Batch Queue" in Section 3, Using Programs in the User's Guide.
Very large comparisons may exceed the CPU limit set by some systems.

Interrupting
a Search: <Ctrl>C

You can type <Ctrl>C to interrupt a search and see the results from the
part of the search that has already been completed. Because the program is
multithreaded, the search may not be interrupted immediately, but will continue
until one of the threads finishes processing its data and returns for more
data.

COMMAND-LINE
SUMMARY

All parameters for this program
may be added to the command line. Use
-CHEck to view
the summary below and to specify parameters before the program executes. In the
summary below, the capitalized letters in the parameter names are the letters
that you must type in order to use the parameter. Square brackets ([ and
]) enclose parameter values that are optional.

Minimal Syntax: % fasta [-INfile1=]ggamma.pep -Default

Prompted Parameters:

[-INfile2=]pir:* specifies the search set

[-OUTfile=]ggamma.fasta specifies the output file name

-BEGin=1 -END=148 sets the range of interest

-WORdsize=2 sets the word size

-EXPect=2.0 lists scores until E() value reaches 2.0

Local Data Files:

-MATRix=fastadna.cmp assigns the scoring matrix for nucleic acids

-MATRix=blosum50.cmp assigns the scoring matrix for proteins

Optional Parameters:

-PROCessors=2 sets the number of threads devoted to the analysis

on a multiprocessor computer

-MINLength=1000 searches only sequences of 1000 or more residues

-MAXLength=5000 searches only sequences of 5000 or fewer residues

-SINce=6.90 limits search to sequences dated on or after June 1990

-ONEstrand searches using only the top strand of nucleotide queries

-PAMfactor uses scoring matrix to calculate initial diagonal scores

-GAPweight=16 sets gap creation penalty (12 is protein default)

-LENgthweight=4 sets gap extension penalty (2 is protein default)

-OPTall=20 computes opt score when the initn score is 20

or higher; sorts on opt score

-NOOPTall doesn't compute opt score during search; sorts on initn

-SWalign creates final alignment as unlimited Smith-Waterman for nuc

-LIStsize=40 shows the best 40 scores (overrides EXPect)

-ALIgn=20 shows the best 20 alignments

-NOALIgn suppresses sequence alignments

-SHOWall shows complete sequences in alignment, not just overlaps

-MARKx=3 sets the alignment display mode

-NOHIStogram suppresses printing the histogram

-LINesize=60 sets number of sequence symbols per line of the alignment

-NODOCLines suppresses sequence documentation in the alignment

-BATch submits the program to run in the batch queue

-NOMONitor suppresses the screen trace for each search set sequence

LOCAL DATA FILES

The files described below
supply auxiliary data to this program. The program automatically reads them from
a public data directory unless you either 1) have a data file with exactly the
same name in your current working directory; or 2) name a file on the command
line with an expression like -DATa1=myfile.dat. For more information
see Section 4, Using Data Files in the User's Guide.

Local
Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence
characters. The program automatically reads the program's default scoring
matrix in a public data directory unless you either 1) have a data file with
exactly the same name as the program default scoring matrix in your current
working directory; or 2) have a data file with exactly the same name as the
program default scoring matrix in the directory with the logical name MyData;
or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp.
If you don't include a directory specification when you name a file with -MATRix,
the program searches for the file first in your local directory, then in the directory
with the logical name MyData, then in the public data directory with the
logical name GenMoreData, and finally in the public data directory with the
logical name GenRunData. For more information see "Using a Special Kind of
Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's
Guide.

FastA
reads a scoring matrix containing the values for every possible match from your
working directory or the public database. The files fastadna.cmp (for nucleic
acid sequences) and blosum50.cmp (for protein sequences) contain the default
values for matches. blosum50.cmp is a BLOSUM50 matrix. You can use the Fetch program to obtain a copy of these files in order to
modify them to suit your own needs.

PARAMETER REFERENCE

Allows you to specify a scoring matrix file name other than the program
default. If you don't include a directory specification when you name a file
with -MATRix, the program searches for the file
first in your local directory, then in the directory with the logical name
MyData, then in the public data directory with the logical name GenMoreData,
and finally in the public data directory with the logical name GenRunData.

For
more information see the Local Scoring Matrices section.

-EXPect=2.0

Shows all scores whose E() value is less than 2.0. Ignored if -LIStsize
is used.

-PROCessors=2

Tells the program to use 2 threads for the database search on a multiprocessor
computer.

-MINLength=1000

Restricts the search to search set sequences that are equal to or longer than
1000 residues.

-MAXLength=5000

Restricts the search to search set sequences that are equal to or shorter than
5000 residues.

-SINce=6.1990

Limits the search to sequences that have been entered into the database or modified
since June 1990. As this is being written, only the EMBL, GenBank, and
SWISS-PROT databases support this parameter.

-ONEstrand

Searches using only the top strand of a nucleotide query sequence.

-PAMfactor

Uses a scoring matrix for the calculation of initial diagonal scores. Instead
of using a constant factor for each match in a word, values are obtained from a
scoring matrix. This is the default option for protein sequences, while -NOPAMfactor is the default for nucleic acid sequences.

-GAPweight=12

Specifies the gap creation penalty that is subtracted from the alignment score
whenever a gap is created.

-LENgthweight=2

Specifies the gap extension penalty that is subtracted from the alignment score
for each residue added to an existing gap.

-OPTall=20

Immediately performs an alignment and calculates the opt score when the initn
score is greater than or equal to 20. This parameter allows you to override the
default threshold calculated by the program. Scores are sorted and saved by opt
score during the search. -NOOPTall doesn't compute the opt score until
the search is complete. In this case scores are sorted and saved by initn score
instead of by opt score.

-SWalign

Does an unlimited Smith-Waterman alignment as the final alignment for nucleotide
searches, instead of the "alignment in a band" version of
Smith-Waterman. (Note: this can be very slow.)

-LIStsize=40

Shows the best 40 scores. Overrides -EXPect.

-ALIgn=10

Limits the number of alignments to display in the output file to the 10 best
matches in the list. Use the -NOALIgn to suppress the sequence alignments in
the output file.

-SHOWall

Shows entire sequences in the alignment display, instead of just the best
region of overlap and its surroundings.

-MARKx=3

Determines the alignment display mode -- especially the symbols that identify
matches and mismatches. The default value, 3, uses a pipe character (|) to show
identities and a colon (:) to show conservative replacements. -MARKx=0
uses a colon to show identities and a period (.) to show conservative
replacements. -MARKx=1 will not mark identities;
instead, conservative replacements are connected with a lowercase x, and
non-conservative substitutions are connected with an uppercase X. If -MARKx=2,
the residues in the second sequence are shown only if they differ from the
first sequence.

Use -MARKx=10
to get aligned sequences in the FastA "parsable" output format. A
document describing this format appears after FastA in the Program Manual.

-NOHIStogram

Suppresses printing the histogram.

-LINesize=60

Lets you set the number of sequence symbols in each line of the alignment to
any number between 60 and 200.

-NODOCLines

Suppresses the documentation from the search set sequence accompanying the
alignment in the output file. Use -DOCLines=5 to copy only five non-blank
lines of documentation.

-BATch

Submits the program to the batch queue for processing after prompting you for
all required user inputs. Any information that would normally appear on the
screen while the program is running is written into a log file. Whether that
log file is deleted, printed, or saved to your current directory depends on how
your system manager has set up the command that submits this program to the
batch queue. All output files are written to your current directory, unless you
direct the output to another directory when you specify the output file.

-MONitor=500

Monitors this program's progress on your screen. Use this parameter to see this
same monitor in the log file for a batch process. If the monitor is slowing
down the program because your terminal is connected to a slow modem, suppress
it with -NOMONitor.

The
monitor is updated every time the program processes 500 sequences or files. You
can use a value after the parameter to set this monitoring interval to some
other number.

All other product names mentioned in this
documentation may be trademarks, and if so, are trademarks or registered
trademarks of their respective holders and are used in this documentation for
identification purposes only.