Information about STEPdb server

Search:

In STEP the search function is performed locally for the displayed proteome/sub-proteome. Consequently search is performed only within the current table.
(e.g. when the user is browsing the Complexome he is able to search only for entries within this group of proteins.)

Search is for the moment not very sophisticated. Logical operators such OR, AND are not currently supported
by the search tool and text is searched intact.
(e.g. words separated with a space are not searched separately)

Glossary:

Basic Proteome of E.coli K-12 (MG1655)

STEP defines the "basic" proteome of E.coli K-12 (MG1655) as the fraction the proteome that is devoid of gene products that are
likely not produced, or that result from enomic insertions enriched in defective prophages, transposons, pseudogenes, integrases and
mobile elements. To define "basic" proteome we used manual annotation. We based this on EcoGene
(Rudd, 2000), Uniprot
(Dimmer et al, 2012) and other
(McClelland et al, 2001;
Ochman et al, 2000) studies. F1 plasmid-encoded proteins
were also removed from the Uniprot data of the E.coli K-12 proteome used (release version of November 2010).
According to our analysis the essential protein-coding sequences devoid of these elements encompasses only 3899 proteins.

Protein structure is encoded in the primary amino acid sequence. However, the folding process
of a protein to its functional form, in some cases, has to overcome miss-folding states that can lead to protein
inclusion bodies.
Niwa et al 2009 have calculated protein solubility
for 3173 Escherichia coli proteins, in a chaperone-free reconstituted translation system.
The aggregation propensity of each protein is examined by centrifugation assay. Solubility is defined as the index
of aggregation propensity which is expressed as the proportion of the supernatant fraction, which is obtained after
the centrifugation of a translation mixture, to the uncentrifuged total protein. Therefore solubility is a percentage
and ranges from 0% to 100%

Manual Curation and non-experimental qualifiers

We follow the same experimental qualifiers with Uniprot:

Potential:
There is some logical or conclusive evidence that the given annotation could apply.
This non-experimental qualifier is often used to present results from protein sequence analysis software
tools, which are only annotated if the result makes sense in the biological context of a given protein.

Probable:
Indicates stronger evidence than the qualifier "Potential". This qualifier implies
that there must be at least some experimental evidence, which indicates, that the information is expected
to be found in the natural environment of a protein.

By similarity:
When some biological information was experimentally obtained for a given protein
(or part of it), it may be transferred to other protein family members within a certain taxonomic range,
dependent on the biological event or characteristic.

Sub-cellular localization special characters and formalisms

To denote multiple localization possibilities that have been experimentally established we introduced the comma "," formalism
whereas a slash "/" denotes two or more possible sub-cellular locations that have not yet been experimentally determined.

We define as exportome those proteins that are localized within the inner membrane and beyond (e.g. lipoproteins, extra-cellular proteins).
This includes the STEPdb sub-cellular classes: B, I, E, F2, F3, G, X, F4.
We divide exportome into two subclasses membranome and secretome. The membranome contains proteins that are embedded in the inner mebrane
whereas secretome referes to proteins that are fully translocated across the inner membrane.
Membranome and secretome can further divided into proteins that are substrates of the Sec and Tat secretion pathways. Finally, within secretome is included a
particular class of non-classical secretory proteins which are secreted without the presence of an apparent signal peptide motif.

Topology and orientation of IM proteins

Transmembrane regions of integral membrane proteins were predicted using Phobius.
In cases where Phobius failed to identify any transmembrane region the prediction of
TMHMM was used instead. The predicted orientation of the polypeptide sequences,
which equals to the location of the C-terminus (cytoplasmic or periplasmic) was reconsidered based on experimental
verification of C-terminus for 734 transmembrane proteins (Daley et al, 2005).

Internal Connections

There are internal connections between some of the tables. Proteins listed in the E.coli K-12 proteome
table connect to the list of the complexes they participate. Further more from the Complexome table the user is able
to view the schematically representation of each complex and through this schematic to link directly to the K-12 table.

Color Code

STEPdb follows a specific color code to represent proteins in the various sub-cellular locations.
The E.coli K-12 export systems and the Peripherome are draw as cartoons where each
protein is represented as a filled circle following the color code below. Additionally in the "Complexome" page,
each complex is drawn dynamically upon clicking "draw" button. The protein subunits of each complex also follow STEPdb's color code.

Protein Symbol

Protein Localization

Nucleoid (N)

Cytoplasmic (A)

Ribosomal (r)

Prepherally associated with the plasma membrane facing the cytoplasm (F1)

Inner membrane protein (B)

Prepherally associated with the plasma membrane facing the periplasm (F2)

Inner membrane lipoprotein (E)

Periplasmic (G)

Outer membrane lipoprotein (I)

Prepherally associated with the outer membrane facing the periplasm (F3)

Multifun Terms

Peripheral inner membrane proteins were classified in eight major categories of cellular function mainly based on Multifun Terms
(Serres & Riley, 2000). These are summarized in the table below.

Cellular Process

Multifun term

GO term

GO id

Metabolism

MultiFun:1 Metabolism

GO:metabolism

GO:0008152

DNA-related

MultiFun:2.1 DNA related

GO:DNA metabolism

GO:0006259

RNA-related

MultiFun:2.2 RNA related

GO:RNA metabolism

GO:0016070

Protein-related

MultiFun:2.3 Protein related

GO:protein biosynthesis

GO:0006412

Transport

MultiFun:4 Transport

GO:transport

GO:0006810

Cell division

MultiFun:5.1 Cell division

GO:cytokinesis

GO:0000910

Response to stress

MultiFun:5.5 Adaptation to stress

GO:response to stress

GO:0006950

Cell structure

MultiFun:6 Cell structure

GO:cellular_component

GO:0005575

MatureP classifier

Methods

MatureP classifier predicts Sec secretory proteins over cytoplasmic ones.
Two methods are provided: 1. MatureP classifier that accepts only the mature sequences of potential secretory or
cytoplasmic proteins (i.e. known or potential signal peptide sequences must be removed)
2. SP-MatureP a combinatorial classifier that takes into account both the MatureP and the pre-protein classifiers.
In SP-MatureP method first the pre-protein classifer predicts the existance of a signal peptide sequence and then the
MatureP classifier tests the validity of the mature sequence. SP-MatureP decides whether a sequence is “cytoplasmic”,
a mature or a secretory pre-protein sequence or, more interestingly, if a sequence is a “non-secretory” (i.e. possessing
a signal peptide but having a non-compatible mature sequence).

MatureP score threshold

MatureP is a linear classifier that explores a variety of features derived from the amino acid sequence such as: amino acids,
di-peptides and tri-peptides or pairwise interaction energy. MatureP assigns a classification score to each provided
sequence. The final decision of the classifier depends on the selected score threshold above which proteins are considered
to be secretory. The most commonly used threshold is zero and following this positively scored sequences are predicted as
secretory. Score threshold can be chosen otherwise. Using the scores of the training samples we can draw the hit rate curves (y-axis)
versus score (x-axis) for both the positive and the negative classes (press button below to draw the hit rate distributions of MatureP). That is the percent of correctly predicted positive/negative
samples per selected score as a classification threshold. When the hit rate of the positive class is increased then the hit rate of
the negative class is decreased. For every classifier there exist a score threshold where the two hit rates are equal.

Escherichia coli
505 Sec-dependent secretory and 2365 cytoplasmic sequences of the Escherichia coli K-12 proteome (STEPdb) were used
during the machine learning analysis. The class of secretory proteins includes eight sub-cellular categories of STEPdb (see table below).
Only proteins that utilize the Sec secretion system for their translocation from the cytoplasm to the periplasm were included.
39 proteins with a Tat signal peptide or the flagellar Type III were excluded.
The cleavage site of the type I signal peptides (e.g. periplasmic proteins excluding lipoproteins) were predicted using
SignalP 4.0 and Phobius.
The cleavage site of the type II signal peptides (i.e. inner and outer membrane lipoproteins) was predicted with
LipoP.

Sub-cellular Location

Stepdb nomenlature

# Proteins

Sec Secretory proteins

Peripheral inner membrane protein facing the periplasm

F2

10

Inner Membrane Lipoprotein

E

21

Periplasmic

G

295

Peripheral outer membrane protein facing the periplasm

F3

8

Outer Membrane Lipoprotein

I

94

Outer Membrane b-barrel protein

H

64

Peripheral outer membrane protein facing the extra-cellular space

F4

12

Extra-cellular

X

1

Total

505

Cytoplasmic

Cytoplasmic

A

1851

Peripheral proteins

F1

514

Total

2365

Other Gram-negative and Gram-positive bacteria

To test if the features that MatureP selects are universal we measured its effectiveness in predicting secretory proteins
from 25 Gram- and 10 Gram+ bacteria from various phyla (7120 and 1361 secretory proteins, see table below).
These were identified as being Sec secretory proteins by combining SignalP 4.0,
LipoP and PRED-TAT.

According to Nielsen et al. the training and test sets should be non-redundant
and that similar (homologous) sequences should be discarded to avoid overestimating the predictive performance of the classifiers.
We performed redundancy reduction in the original dataset (above) following the procedures used by
SignalP using the algorithm of
Hobohm that performs iterative position specific alignments.
The blast+ suite of NCBI was utilized: makeblastdb command to convert the input fasta files into blast database files and the psiblast
command that implements the position-specific iterative basic local alignment search of Altschul et al.
This resulted in a non-redundant dataset of 1070 cytoplasmic proteins, 207 preproteins and 247 mature domain sequences.