The ProDom data file description

ProDom is a protein domain family database
constructed automatically by clustering homologous
segments. The ProDom building procedure MKDOM2 is based on
recursive PSI-BLAST
searches [ALTS2]. The
source protein sequences are non-fragmentary sequences
derived from UniProtKB (SWISS-PROT and TrEMBL databases). ProDom was
first established in 1993 [SONN] and maintained by the
Laboratoire de Génétique Cellulaire and the Laboratoire de
Interactions Plantes-Microorganismes
(INRA/CNRS) in Toulouse. It is now maintained by the PRABI (bioinformatics center of Rhone-Alpes). The ProDom database consists of
domain family entries. Each entry provides a multiple
sequence alignment of homologous domains and a family
consensus sequence.

A ProDom entry is characterised by a unique
accession number. The purpose of accession numbers is
to provide a stable way of identifying entries through
releases. As ProDom is built anew every new version,
we have developped a tool which allows to transfer stably
accession numbers from a version to another by searching
for domain family overlaps between both versions. If an
entry is split into two or more entries, the accession
number of the parent entry is assign to one of the child
entry, and new accession numbers are created for the other
children entries. If two or more entries are merged, the
accession number of one of the parent entries is assign to
the child entry; the other parent accession numbers are
stored as "obsolete" accession numbers, and they refer to
the child one. If an entry is deleted from the database,
its accesssion number is stored as "deleted" accession
number in ProDom database.

Each domain is a segment derived from a
protein sequence. Such a sub-sequence is
identified by the name of the protein in the SWISS-PROT or TrEMBL
database, followed by the start and end
points of the domain in the whole amino acid sequence (domain
boundaries). The SWISS-PROT sequence identifiers have two
parts. The first one is the name of the entry (maximum four
letters), and the second part is a code for the organism
from which the sequence is extracted (maximum 5
letters). The TrEMBL identifiers are the sequence accession
number of the protein in the TrEMBL database. ProDom adds
the same 5 letter organism code as used in SWISS-PROT to
the TrEMBL identifier when the OS line of the TrEMBL entry
allows to find the organism from which the sequence
is. Otherwise, the first word of the OC line is used to
make a "rough" organism code reflecting the domain of life
and added to the TrEMBL identifier as described in the
following table:

Code used in ProDom

Domain of life

EEEEE

Eukaryota

BBBBB

Bacteria

AAAAA

Archea

VVVVV

Viruses

XXXXX

Unknown

Structure of a domain family entry

The entries in the ProDom database are structured so as to be usable by human readers
as well as by computer programs. The comments and keywords are in ordinary English.
Each family entry is composed of lines. Different types of lines, each with their own format,
are used to record the data that make up the entry.
A sample domain family entry is shown below :

Each line begins with a two-character line code, which indicates the type of data
contained in the line. The current line types and line codes and order in which they appear in an entry, are
shown in the table below.

Line code

Content

Occurrence in an entry

ID

IDentification

Once ; starts the entry

AC

ACcession number

Once

KW

KeyWords

Once

LA

Length of Alignment

Once

ND

Number of Domains

Once

NM

NorMD value

Once

CC

Parsable comments

Three times

DC

Database Comments

Optional

AL

Domain Alignment Line

One or more

CO

Consensus sequence

Once

DR

Database cross-References

Optional

CT

Copyright notice

Required (2 lines)

//

Termination line

Once ; ends the entry

As shown in the above table, some line types are found in all entries, other are optional. Some line types occur many times in a single entry. Each entry must begin with an IDentification line (ID) and end with a Termination line (//).
A detailed description of each line type is given in the next section of this document.

The ID (IDentification) line is always the first line of an entry.
The general form of the ID line is :ID ENTRY_NUMBER pRELEASE NUMBER_OF_DOMAINS seq.

Entry number

The entry number is a number which characterises a family within a
ProDom version; it is not stable through successive ProDom
releases: this entry number is equal to the rank of a
family, after sorting ProDom by decreasing number of
domains in the family.

Release

The release number indicates the ProDom release of the current entry.
As the database is built de-novo at each release, we strongly advise users to completely reload
ProDom at each new release.

Number of domains

The number of domains is the number of homologous sub-sequences in the multiple
alignment of the family. A protein could have several homologous domains of the same family; each occurrence
of the domain is counted.

Example

ID 20167 p2002.1 10 seq.

The AC line

The AC (ACcession number) line is a stable and unique key associated to
each ProDom entry to access the database.
The format of the accession number is: the 2 letters PD followed
by exactly 6 digits.
For Prodom-CG, the format of the accession number is: the 2
letters CG folloed by the same digits.

Example

AC PD266930

AC CG266930

The KW line

The KW line contains keywords which can help to identify the domain
family characteristics of the ProDom entry.
The general form of the KW line is :KW [FREQUENT_NAME(OCCURRENCE)...] // KEYWORD [KEYWORD ...]

Frequent name and occurrence

The frequent name is one of the three most frequent sequence names in the family.
A sequence name is the sequence identifier in an UniProt
entry without the 5 letters organism code. The occurrence is the number of times this name appears in the family.

Keyword

A keyword is one of the 10 most frequent words found in the KW and DE lines of the
UniProt entries of all the domain family members.
Up to 10 keywords could be listed on the KW line, and the
keywords are sorted by decreasing frequency. The building
procedure of this automatic comment could be improved in futur
ProDom releases.

The LA (Length of Alignment) line provides
the length of a domain sequence, with the gaps, once aligned
with the other homologous domains of the family.

Example

LA 74

The ND line

The ND (Number of Domains) line gives the number of homologous sub-sequences in the family.

Example

ND 10

The NM line

The NorMD value [THOM] computed for this family: it is generally admitted that the quality of the
alignment may be considered as "good" if the NorMD value is higher than 0.4.

Example

NM 0.506

The CC lines

The CC lines are parsable comments about the
ProDom family entry. They are used to record some family
consistency indicators, and the name of the domain closest to
consensus.
The general form of a CC line is�:
CC -!- TOPIC: INFORMATION
There are three topic types: DIAMETER, RADIUS OF GYRATION, and SEQUENCE CLOSEST TO CONSENSUS.

The diameter

The diameter is the maximal distance between
two domains in the family. The distances are computed in PAM. In
some cases, the distance between those domains can not be
computed, so the value "1001 PAM" is given as default value.

The radius of gyration

The radius of gyration is the weighted root
mean square distance between each domain and the family
consensus sequence. The distances are also computed in PAM.

The sequence closest to consensus

The sequence closest to consensus is the
sub-sequence whose distance to the family
consensus sequence is the smallest. This information can help to
select a domain representing the family at best.

This line indicates the request used by
Psiblast to build this family: a UniProt
sequence, or a SCOP domain.

Example

DC This family was generated by psi-blast, with a profile built from the seed aligment of the following SCOP FAMILY
DC a.4.5.6

The AL lines

Each AL (Alignment Line) represents a domain
aligned with all the homologous domains the family.
The general form of the AL line is�:

AL SWISS-PROT_AC|SWISS-PROT_ID BEGIN END WEIGHT ALIGNED_SEQUENCE

or:

AL TREMBL_AC|TREMBL_ID_SPECIES BEGIN END WEIGHT ALIGNED_SEQUENCE

The SWISS-PROT and the TrEMBL accession numbers

The SWISS-PROT or the TrEMBL accession
number is the accession number of the protein sequence in
respectively the SWISS-PROT or TrEMBL database.

The SWISS-PROT and the TREMBL identifiers

The SWISS-PROT identifier is the sequence
identifier of the protein in the SWISS-PROT database. The
TrEMBL identifier is the accession number of the sequence
in the TrEMBL database modified as decribed "database
conventions"

The domain begin and end

The begin and end numbers provide the
boundaries of the domain in the whole protein sequence. The
amino acid numbering is the same as the SWISS-PROT and TrEMBL
ones.

The weight and the aligned sequence

In Prodom families, the multiple sequence
alignment and the weights are computed by Multalin [CORP1].
Sequence weights allow to downweigh overly similar sequences in
the alignment. The smaller the weight, the most very similar
domains the current domain has in the family.
The aligned sequences are given with two types of gaps:
.
for gaps at domain extremities (external gaps), and
- for gaps inside the domain sequence
(internal gaps).

The CO (COnsensus) line contains the
consensus sequence of the domain family. It is computed by
Multalin from the family multiple alignment. For each column of
the multiple alignment, external gaps are not taken in acount
when calculating the consensus amino acid. Thus, there is no
external gap in the consensus sequence; only internal
gaps are allowed.

The DR (Databases cross-Reference) lines are
used as pointers to information related to ProDom entry and
found in data collections other than ProDom.
The general form of a DR line is�:
DR DATABASE_IDENTIFIER; INFORMATION

The database identifier

ProDom families are currently cross-referenced to the following databases�:

Identifier

Database description

GO

GENE ONTOLOGY database

INTERPRO

INTERPRO protein families database

PROSITE

PROSITE protein domains and families database

PFAMA

Pfam-A protein domain database

PDB

Brookhaven Protein Data Bank

The cross-reference information

The cross-reference information is
constituted by an unambiguous pointer to the information
entry in the target database, and some extra information
such as the name of the relevant domain, or the position in
the sequence.

GO: the cross-reference information is the accession number of the
GO entry, the corresponding ontology (biological Process, molecular Function,
Cellular component), a precision indicator (from 0 to 1, the highest means the highest precision
in the Gene Ontology tree, ie the nearest from a leaf), a probability of assignation, and the entry name.

DR GO; GO:0006810 P 0.275 1.00 "transport"

INTERPRO: the cross-reference information
is the accession number of the INTERPRO entry and the
name of the entry.DR INTERPRO; IPR000524 "Bacterial regulatory proteins, GntR">

PROSITE: the cross-reference information
is the accession number of the PROSITE entry, the
accession number of the associated documentation
(PDOC), the identifier of the PROSITE entry and the
position of the pattern matching on the consensus
sequence.DR PROSITE; PS00043 PDOC00042 HTH_GNTR_FAMILY (27-51)

PFAM-A: The cross-reference information is the
accession number and the identifier of the Pfam-A
entry. DR PfamA; PF00392 gntR

PDB The cross-reference information is the PDB
code of the three dimensional structure, the chain
number, the position of the match in the structure, the name of the relevant
sequence and the position of the match in the
sequence. As several chains or several pdb Id
generally match the same swissprot entry, those other
pdb entries are indicated with comman
(,) as separator.
DR PDB; 1H9T chain B (5-78) Q8ZP15_SALTY (5-78),1HW1 chain A (5-78),1HW1 chain B (5-78)