Schema for the ProteomeGenerator snakemake workflow. Sequencing reads are aligned using STAR followed…

Figure 2.

Schema for the ProteomeGenerator snakemake workflow. Sequencing reads are aligned using STAR followed by their de novo or referenced assembly intro transcriptomes using StringTie and processing to identify reading frames and protein isoforms. The resulting protein database is set as the target for peptide–mass spectral matching using MaxQuant.

Figure 3.

5

Comparison of the canonical and…

Figure 3.

13

Comparison of the canonical and proteogenomic protein databases displaying (A) number of protein…

Figure 3.

Comparison of the canonical and proteogenomic protein databases displaying (A) number of protein entries (B) and theoretical tryptic peptides amenable for mass spectrometry analysis specific for either UniProt, PGX, or both.

Sensitivity and specificity of mass spectrometry search algorithms. (A, B) Comparison of unique theoretical peptides in the experimental PGX proteome, canonical UniProt, and bacterial proteomes used as negative controls. (C) Sensitivity of tested algorithms expressed as the number of identified peptides. (D) Specificity of tested algorithms evaluated from the fraction of peptide–spectrum matches mapped to the negative controls. The PSM fraction mapped to A. loki is reported both in absolute terms (black) and normalized to take into account the relative sizes of the human and archaebacterial proteomes (shown in gray). Normalization was performed by multiplying the number of human peptides by the ratio of the A. loki and H. sapiens databases, expressed as the number of tryptic peptides.

Accurate proteome discovery using statistical target–decoy matching with spectral calibration. (A) Number of peptides identified (FDR < 0.01) based on matching spectra from K052 proteome against proteogenomic (PGX, red) and canonical (UniProt, gray) databases. (B) Overlap between the peptides identified in PGX (red) and UniProt (gray) databases. (C) Comparison of PEAKS scores for peptides identified in both PGX and UniProt databases. (D) PEAKS score distribution for peptides identified exclusively in PGX (red) and UniProt (gray) databases. (E) For peptides exclusively identified against the PGX database, PEAKS score distributions for peptides not mapping in UniProt (red) or present in the canonical database (gray). Boxes delimit the 25th and 75th percentiles, the middle line corresponds to the median, and whiskers correspond to the 5th and 95th percentiles. (F) PEAKS score distributions for peptides identified exclusively in PGX but also mapping in UniProt (gray) or exclusively mapping in the PGX database (red).