The best source of sequence files based on PDB structures is either from
the Astral database or from Roland Dunbrack's lab. He calls his PDB
sequence resource PISCES: http://dunbrack.fccc.edu/PISCES.php. It is
updated weekly and all PDB protein sequences are available at various
%identity and resolution cutoffs.
His S2C files also provide a mapping between SEQRES and ATOM residues,
which might solve your other problem. In addition, it lists the
secondary structure as defined in the PDB file and as calculated by a
single program (STRIDE). So this also might save you a lot of time. You
can download his S2C files for the entire PDB.
His sequence files correctly translate modified residues to their
nearest equivalent, as opposed to other sites like NCBI, which translate
them as X's. (e.g., selenomethionine is M instead of X, phosphorylated
tyrosine is Y instead of X, etc.)
There are two sequences that are relevant to any given chain in a PDB
file. One is the "SEQRES" sequence, and one is what I call the "ATOM"
sequence, or the residues that are visible in the electron density and
reported in the ATOM records. The SEQRES sequence (from the SEQRES
records in the PDB file) should, in theory, reflect all the residues in
the protein that crystallized, including those that might be disordered.
The "ATOM" sequence is derived from the ATOM/HETATM coordinate records
and reflects only what could be fit from the experimental electron
density.
For example, your structure might be a homotetramer. There will be four
sets of SEQRES records, chains A,B,C,D, all of which are identical. On
the other hand, different residues in each of the four monomers might be
disordered in the electron density, so the sequences corresponding to
the "ATOM sequence" of the four chains may all be different.
Dunbrack's website provides a mapping of the residue numbering between
the SEQRES sequence and the ATOM sequence:
http://dunbrack.fccc.edu/Guoli/s2c/index.php
For example, the SEQRES sequence will always start at 1 and go to n,
where n is the length of the SEQRES sequence. This would be the
numbering if you used the sequence for a BLAST search. On the other
hand, the ATOM residue numbering might be based (as it should) on the
full length biologically relevant protein. For example, Q9KL26 from V.
cholerae is predicted to be a membrane bound protein and the structure
of only the N-terminal half of the protein is in the PDB (3c8c). If any
residues were disordered, these would be listed in the SEQRES column.
The S2C file provides the residue number mapping between the SEQRES and
ATOM sequence, as well as the secondary structure and %solvent
accessibility:
SEQCRD A S SER SER 1 61 H T 97
SEQCRD A L LEU LEU 2 62 H T 25
SEQCRD A R ARG ARG 3 63 H T 82
SEQCRD A S SER SER 4 64 H T 47
SEQCRD A M MSE MSE 5 65 - - -
SEQCRD A V VAL VAL 6 66 H H 6
...
Notice how the selenomethionine (MSE) is translated as M, but the
secondary structure is not reported since a lot of these programs (e.g.,
STRIDE) don't know how to handle the HETATM records as part of the
chain. That's probably why STRIDE annotates the first 4 residues as
"turn" as opposed to "helix".
Mike
-----Original Message-----
From: pdb-l-bounces from sdsc.edu [mailto:pdb-l-bounces from sdsc.edu] On Behalf
Of Narges Habibi
Sent: Wednesday, April 23, 2008 4:26 AM
To: pdb-l from sdsc.edu; proteins from magpie.bio.indiana.edu
Subject: pdb-l: About PDB Files and Secondary Structures
Hi all,
I'm doing a project on "Protein Contact Map Prediction" and I use some
features for nueral network's input, including Secondary Structure of a
given Amino Acid. There are several ways:
1- getting dssp file for each pdb file (from ftp server)
2- extracting from pdb file (The HELIX and SHEET and TURN section)
3- getting ss file from www.pdb.org (as I see the given sequences in
this file don't match with the pdb files, why?)
What do you suggest? What method is more accurate?
Thanks in advance
--
Narges Habibi
TO UNSUBSCRIBE OR CHANGE YOUR SUBSCRIPTION OPTIONS, please see
https://lists.sdsc.edu/mailman/listinfo.cgi/pdb-l .