Wiki

Function

Report on signal cleavage sites in a protein sequence

Description

sigcleave predicts the site of cleavage between a signal sequence and the mature exported protein using the method of von Heijne. It reads one or more protein sequences and writes a standard EMBOSS report with the position, length and score of each predicted signal sequence. Optionally, you may specify the sequence is prokaryotic and this will change the default scoring data file used. The predictive accuracy is estimated to be around 75-80% for both prokaryotic and eukaryotic proteins.

Algorithm

sigcleave uses the method of von Heijne as modified by von Heijne in his later book where treatment of positions -1 and -3 in the matrix is slightly altered (see references). The minimum scoring weight value (-minweight) for the predicted cleavage site is specified. The value of -minweight should be at least 3.5. At this level, the method should correctly identify 95% of signal peptides, and reject 95% of non-signal peptides. The cleavage site should be correctly predicted in 75-80% of cases.

Specifies the sequence is prokaryotic and changes the default scoring data file name

Boolean value Yes/No

No

Advanced (Unprompted) qualifiers

(none)

Associated qualifiers

"-sequence" associated seqall qualifiers

-sbegin1-sbegin_sequence

integer

Start of each sequence to be used

Any integer value

0

-send1-send_sequence

integer

End of each sequence to be used

Any integer value

0

-sreverse1-sreverse_sequence

boolean

Reverse (if DNA)

Boolean value Yes/No

N

-sask1-sask_sequence

boolean

Ask for begin/end/reverse

Boolean value Yes/No

N

-snucleotide1-snucleotide_sequence

boolean

Sequence is nucleotide

Boolean value Yes/No

N

-sprotein1-sprotein_sequence

boolean

Sequence is protein

Boolean value Yes/No

N

-slower1-slower_sequence

boolean

Make lower case

Boolean value Yes/No

N

-supper1-supper_sequence

boolean

Make upper case

Boolean value Yes/No

N

-scircular1-scircular_sequence

boolean

Sequence is circular

Boolean value Yes/No

N

-squick1-squick_sequence

boolean

Read id and sequence only

Boolean value Yes/No

N

-sformat1-sformat_sequence

string

Input sequence format

Any string

-iquery1-iquery_sequence

string

Input query fields or ID list

Any string

-ioffset1-ioffset_sequence

integer

Input start position offset

Any integer value

0

-sdbname1-sdbname_sequence

string

Database name

Any string

-sid1-sid_sequence

string

Entryname

Any string

-ufo1-ufo_sequence

string

UFO features

Any string

-fformat1-fformat_sequence

string

Features format

Any string

-fopenfile1-fopenfile_sequence

string

Features file name

Any string

"-outfile" associated report qualifiers

-rformat2-rformat_outfile

string

Report format

Any string

motif

-rname2-rname_outfile

string

Base file name

Any string

-rextension2-rextension_outfile

string

File name extension

Any string

sig

-rdirectory2-rdirectory_outfile

string

Output directory

Any string

-raccshow2-raccshow_outfile

boolean

Show accession number in the report

Boolean value Yes/No

N

-rdesshow2-rdesshow_outfile

boolean

Show description in the report

Boolean value Yes/No

N

-rscoreshow2-rscoreshow_outfile

boolean

Show the score in the report

Boolean value Yes/No

Y

-rstrandshow2-rstrandshow_outfile

boolean

Show the nucleotide strand in the report

Boolean value Yes/No

Y

-rusashow2-rusashow_outfile

boolean

Show the full USA in the report

Boolean value Yes/No

N

-rmaxall2-rmaxall_outfile

integer

Maximum total hits to report

Any integer value

0

-rmaxseq2-rmaxseq_outfile

integer

Maximum hits to report for one sequence

Any integer value

0

General qualifiers

-auto

boolean

Turn off prompts

Boolean value Yes/No

N

-stdout

boolean

Write first file to standard output

Boolean value Yes/No

N

-filter

boolean

Read first file from standard input, write first file to standard output

Boolean value Yes/No

N

-options

boolean

Prompt for standard and additional values

Boolean value Yes/No

N

-debug

boolean

Write debug output to program.dbg

Boolean value Yes/No

N

-verbose

boolean

Report some/full command line options

Boolean value Yes/No

Y

-help

boolean

Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose

Boolean value Yes/No

N

-warning

boolean

Report warnings

Boolean value Yes/No

Y

-error

boolean

Report errors

Boolean value Yes/No

Y

-fatal

boolean

Report fatal errors

Boolean value Yes/No

Y

-die

boolean

Report dying program messages

Boolean value Yes/No

Y

-version

boolean

Report version number and exit

Boolean value Yes/No

N

Input file format

sigcleave reads one or more protein sequences.

The input is a standard EMBOSS sequence query (also known as a 'USA').

Major sequence database sources defined as standard in EMBOSS
installations include srs:embl, srs:uniprot and ensembl

Data can also be read from sequence output in any supported format
written by an EMBOSS or third-party application.

The input format can be specified by using the
command-line qualifier -sformat xxx, where 'xxx' is replaced
by the name of the required format. The available format names are:
gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir
(nbrf), swissprot (swiss, sw), dasgff and debug.

Output file format

The output is a standard EMBOSS report file.

The results can be output in one of several styles by using the
command-line qualifier -rformat xxx, where 'xxx' is replaced
by the name of the required format. The available format names are:
embl, genbank, gff, pir, swiss, dasgff, debug, listfile, dbmotif,
diffseq, draw, restrict, excel, feattable, motif, nametable, regions,
seqtable, simple, srs, table, tagseq.

Data files

EMBOSS data files are distributed with the application and stored
in the standard EMBOSS data directory, which is defined
by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your
current directory for you to inspect or modify, run:

% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories.
Project specific files can be put in the current directory, or for
tidier directory listings in a subdirectory called
".embossdata". Files for all EMBOSS runs can be put in the user's home
directory, or again in a subdirectory called ".embossdata".

Notes

sigcleave may predict any number of cleavage sites in a protein sequence but not all of these will be biologically relevant; the prediction algorithm is not perfect. There is no cutoff to eliminate sites because it is down to human expertise to decide what is relevant or not. Although the end of a protein sequence is usually easy to predict from a nucleotide sequence, the same cannot be said for the start which depends on such things as promoters, transcriptional control and splicing. This is why all predicted cleavage sites are reported.

It is often useful to specify to use just the starting region of the input sequence using the in-built qualifier -send. For example, adding -send 50 to the command-line will check only the first 50 residues.