SSMap: a new UniProt-PDB mapping resource for the curation of structural-related information in the UniProt/Swiss-Prot Knowledgebase.

David FP, Yip YL - BMC Bioinformatics (2008)

Bottom Line:
SSMap was compared to other existing mapping resources in terms of the correctness of the attribution of PDB chains to UniProtKB entries, and of the quality of the pairwise alignments supporting the residue-residue mapping.It was found that SSMap shared about 80% of the mappings with other mapping sources.SSMap mapping is currently used to provide PDB cross-references in UniProtKB.

Background: Sequences and structures provide valuable complementary information on protein features and functions. However, it is not always straightforward for users to gather information concurrently from the sequence and structure levels. The UniProt knowledgebase (UniProtKB) strives to help users on this undertaking by providing complete cross-references to Protein Data Bank (PDB) as well as coherent feature annotation using available structural information. In this study, SSMap - a new UniProt-PDB residue-residue level mapping - was generated. The primary objective of this mapping is not only to facilitate the two tasks mentioned above, but also to palliate a number of shortcomings of existent mappings. SSMap is the first isoform sequence-specific mapping resource and is up-to-date for UniProtKB annotation tasks. The method employed by SSMap differs from the other mapping resources in that it stresses on the correct reconstruction of the PDB sequence from structures, and on the correct attribution of a UniProtKB entry to each PDB chain by using a series of post-processing steps.

Results: SSMap was compared to other existing mapping resources in terms of the correctness of the attribution of PDB chains to UniProtKB entries, and of the quality of the pairwise alignments supporting the residue-residue mapping. It was found that SSMap shared about 80% of the mappings with other mapping sources. New and alternative mappings proposed by SSMap were mostly good as assessed by manual verification of data subsets. As for local pairwise alignments, it was shown that major discrepancies (both in terms of alignment lengths and boundaries), when present, were often due to differences in methodologies used for the mappings.

Conclusion: SSMap provides an independent, good quality UniProt-PDB mapping. The systematic comparison conducted in this study allows the further identification of general problems in UniProt-PDB mappings so that both the coverage and the quality of the mappings can be systematically improved for the benefit of the scientific community. SSMap mapping is currently used to provide PDB cross-references in UniProtKB.

Mentions:
From the PDB perspective, 89% of all the protein PDB chains (90,923) were mapped unambiguously to at least one UniProtKB entry with at least 90% sequence identity (Figure 3). Among these 90,923 mapped PDB chains, 126 were mapped unambiguously to several UniProtKB entries. In all these cases, different fragments of PDB chains were mapped to a different UniProtKB entry. These corresponded mostly to immune system or viral proteins with a high conservation (sequence identity > 90%), or to fusion proteins (e.g. PDB:1R6Z chain Z or PDB:2JAD chain A). About 2% (1,741) of the PDB chains were only supported by alignments with a sequence identity lower than 90%. There was ambiguity for 6% (6,792) of the PDB chains, where possible attribution to several UniProtKB entries existed (Figure 3). A small number (61) of PDB chains were small synthetic peptides not mapped at the taxonomy level to UniProtKB entries. For these 3 last PDB chain datasets, associated mappings could not be validated automatically and were thus not included in the final mapping results. The remaining 3% (2,918) of protein PDB chains were not found at all among the available SSMap alignments (sequence identity greater than 70%). Among these ones, nearly all (2,628) chains were shorter than 20 residues; the rest (290) often contained modified/unknown residues or presented unresolved segments.

Mentions:
From the PDB perspective, 89% of all the protein PDB chains (90,923) were mapped unambiguously to at least one UniProtKB entry with at least 90% sequence identity (Figure 3). Among these 90,923 mapped PDB chains, 126 were mapped unambiguously to several UniProtKB entries. In all these cases, different fragments of PDB chains were mapped to a different UniProtKB entry. These corresponded mostly to immune system or viral proteins with a high conservation (sequence identity > 90%), or to fusion proteins (e.g. PDB:1R6Z chain Z or PDB:2JAD chain A). About 2% (1,741) of the PDB chains were only supported by alignments with a sequence identity lower than 90%. There was ambiguity for 6% (6,792) of the PDB chains, where possible attribution to several UniProtKB entries existed (Figure 3). A small number (61) of PDB chains were small synthetic peptides not mapped at the taxonomy level to UniProtKB entries. For these 3 last PDB chain datasets, associated mappings could not be validated automatically and were thus not included in the final mapping results. The remaining 3% (2,918) of protein PDB chains were not found at all among the available SSMap alignments (sequence identity greater than 70%). Among these ones, nearly all (2,628) chains were shorter than 20 residues; the rest (290) often contained modified/unknown residues or presented unresolved segments.

Bottom Line:
SSMap was compared to other existing mapping resources in terms of the correctness of the attribution of PDB chains to UniProtKB entries, and of the quality of the pairwise alignments supporting the residue-residue mapping.It was found that SSMap shared about 80% of the mappings with other mapping sources.SSMap mapping is currently used to provide PDB cross-references in UniProtKB.

Background: Sequences and structures provide valuable complementary information on protein features and functions. However, it is not always straightforward for users to gather information concurrently from the sequence and structure levels. The UniProt knowledgebase (UniProtKB) strives to help users on this undertaking by providing complete cross-references to Protein Data Bank (PDB) as well as coherent feature annotation using available structural information. In this study, SSMap - a new UniProt-PDB residue-residue level mapping - was generated. The primary objective of this mapping is not only to facilitate the two tasks mentioned above, but also to palliate a number of shortcomings of existent mappings. SSMap is the first isoform sequence-specific mapping resource and is up-to-date for UniProtKB annotation tasks. The method employed by SSMap differs from the other mapping resources in that it stresses on the correct reconstruction of the PDB sequence from structures, and on the correct attribution of a UniProtKB entry to each PDB chain by using a series of post-processing steps.

Results: SSMap was compared to other existing mapping resources in terms of the correctness of the attribution of PDB chains to UniProtKB entries, and of the quality of the pairwise alignments supporting the residue-residue mapping. It was found that SSMap shared about 80% of the mappings with other mapping sources. New and alternative mappings proposed by SSMap were mostly good as assessed by manual verification of data subsets. As for local pairwise alignments, it was shown that major discrepancies (both in terms of alignment lengths and boundaries), when present, were often due to differences in methodologies used for the mappings.

Conclusion: SSMap provides an independent, good quality UniProt-PDB mapping. The systematic comparison conducted in this study allows the further identification of general problems in UniProt-PDB mappings so that both the coverage and the quality of the mappings can be systematically improved for the benefit of the scientific community. SSMap mapping is currently used to provide PDB cross-references in UniProtKB.