146266 Biological Macromolecular Structures Enabling Breakthroughs in Research and Education

This browser is either not Javascript enabled or has it turned off. This site will not function correctly without Javascript.

Redundancy in the Protein Data Bank

Statistics

The following table shows the number of non-redundant sequences as determined by blastclust
at several levels of sequence identity.

Method

Description

# of Clusters

blast

100% identity

69634

blast

95% identity

56270

blast

90% identity

53273

blast

70% identity

46463

blast

50% identity

39393

blast

40% identity

34455

blast

30% identity

28824

Notes on Blast Clustering

Blast clustering is performed with the following parameters (example
95%):

-p T -b T -S 95

The '-b T' parameter in BLASTClust means that the sequence identity threshold
is enforced over both members of a sequence pair.

The '-p T' parameter means that both input sequences are protein sequences.

The '-S 95' here: the percent identity threshold used to include two sequences in a cluster.

Note: BLASTClust uses the default parameter -L, which specifies the length coverage threshold for including
in a cluster. It set to 0.9 by default. This means that two sequences need to have >= 90% coverage in
the alignment for clustering them together.

As the single worldwide repository for macromolecular structures, the
Protein Data Bank holds a body of data that contains considerable
redundancy in regard to both sequence and structure. We have
incorporated into the query interface the ability to select a subset of
structures from which similar sequences have been largely removed. In
most cases, the selected subset will contain far fewer structures than
the complete result set. However, the following caveats should be kept
in mind:

Sequence similarity is defined on a chain basis, but results are
returned on a structure basis.

Many structures in the PDB contain multiple protein chains, or even
hybrids of DNA and protein chains.

Sequence similarity is only assessed for protein chains.

The relationship between sequence similarity and structure similarity
is complex. Users seeking structure similarity should refer to the options
available on the Structure Summary page under "External Links" (in the
left hand navigation menu) in the "Structure Classification" section.

The primary purpose of this feature is to filter a list of likely highly
similar structures to provide one or more representatives. Results may
differ from other so-called non-redundant sets (e.g. PDB_SELECT [Hobohm
U., and Sander C.,Protein Science, 3: 522-524, 1994]).

Algorithm for Removing Similar Sequences

The query implementation for removing similar sequences is based on
pre-calculated clusters of protein chains.
All protein chains of at least 20 amino acids are clustered by blastclust
at 100%, 95%, 90%, 70%, 50%, 40%, and 30% sequence identity (defined as number of
identical residues out of total in the sequence alignment).

In each cluster, the chains are sorted (i.e. ranked) according to the
following criteria (in this order):