Wiki

Function

Remove non-alphabetic (e.g. gap) characters from sequences

Description

degapseq reads one or more sequences and writes them out again but stripped of any non-alphabetic characters. It's main purpose is to remove gap characters from aligned sequences, but it will also remove such things as the symbol for translation STOP ('*') in a protein sequence.

Read first file from standard input, write first file to standard output

Boolean value Yes/No

N

-options

boolean

Prompt for standard and additional values

Boolean value Yes/No

N

-debug

boolean

Write debug output to program.dbg

Boolean value Yes/No

N

-verbose

boolean

Report some/full command line options

Boolean value Yes/No

Y

-help

boolean

Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose

Boolean value Yes/No

N

-warning

boolean

Report warnings

Boolean value Yes/No

Y

-error

boolean

Report errors

Boolean value Yes/No

Y

-fatal

boolean

Report fatal errors

Boolean value Yes/No

Y

-die

boolean

Report dying program messages

Boolean value Yes/No

Y

-version

boolean

Report version number and exit

Boolean value Yes/No

N

Input file format

degapseq reads one or more nucleotide or protein sequences.

The input is a standard EMBOSS sequence query (also known as a 'USA').

Major sequence database sources defined as standard in EMBOSS
installations include srs:embl, srs:uniprot and ensembl

Data can also be read from sequence output in any supported format
written by an EMBOSS or third-party application.

The input format can be specified by using the
command-line qualifier -sformat xxx, where 'xxx' is replaced
by the name of the required format. The available format names are:
gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir
(nbrf), swissprot (swiss, sw), dasgff and debug.

Output files for usage example

File: nogaps.seq

Data files

None.

Notes

There are many different formats for storing molecular sequences in files. Some formats are specifically for aligned sequences, where gaps are inserted into the sequences for purposes of alignment. Gaps are indicated with different characters depending on the format in question, but commonly include '.', '-' and '~'. Some formats use more than one type of character to indicate different types of gaps, for example gaps at the sequence ends, internal gaps, gaps inserted by a program and gaps inserted manually by a person editing the alignment may all be denoted with different characters.

EMBOSS uses the dash character ('-') only to indicate gaps. When an EMBOSS program reads a sequence with gaps, all gap characters are changed internally to a dash ('-'). Thus any distinguishing characters for different gap types are convered to a '-' on output.

References

None.

Warnings

It will remove '*' characters from protein sequences as well as removing
the gap characters.