Abstract

Differentiating and quantifying protein differences in complex samples produces significant challenges in sensitivity and specificity. Label-free quantification can draw from two different information sources: precursor intensities and spectral counts. Intensities are accurate for calculating protein relative abundance, but values are often missing due to peptides that are identified sporadically. Spectral counting can reliably reproduce difference lists, but differentiating peptides or quantifying all but the most concentrated protein changes is usually beyond its abilities. Here we developed new software, IDPQuantify, to align multiple replicates using principal component analysis, extract accurate precursor intensities from MS data, and combine intensities with spectral counts for significant gains in differentiation and quantification. We have applied IDPQuantify to three comparative proteomic data sets featuring gold standard protein differences spiked in complicated backgrounds. The software is able to associate peptides with peaks that are otherwise left unidentified to increase the efficiency of protein quantification, especially for low-abundance proteins. By combing intensities with spectral counts from IDPicker, it gains an average of 30% more true positive differences among top differential proteins. IDPQuantify quantifies protein relative abundance accurately in these test data sets to produce good correlations between known and measured concentrations.

The IDPQuantify workflow extracts chromatograms for identified peptides to build intensity information for proteins. The mzML data model allows for identification through standard tools, with IDPicker generating sets of confidently identified spectra and assembling proteins. IDPQuantify uses QuasiTel to integrate chromatograms associated with each peptide and sums these peak areas for each protein group. Functions in the R language generate p-values from intensities and spectral counts, integrating them for combined analysis.

Peptide repeatability in technical replicates of yeast lysates from CPTAC Study 5. (a) Distinct peptide matches shared among replicates. ~30% peptides are only found in one replicate. Most (67%–79%) peptides were not universal among all six replicates. (b) Boxplot of log intensity for the peptides shared by different numbers of replicates. The boxes represent the interquartile range, while the whiskers represent the full range of observed values. The midline in each box is the median. When a peptide was observed in multiple replicates, the figure records the median log intensity across replicates. Peptides that were observed in more replicates were more intense than those appearing in fewer replicates.

Peptide retention time mapping effectively increased apparent intensity correlation in CPTAC Study 6 replicates. The heatmap plots pairwise correlations between UPS peptide intensities in 12 replicates in each Group (A, B) before (left column) or after (right column) RT mapping. The 12 replicates consisted of data from 4 institutes with 3 replicates by each institute, for example, O65_B1 is the data from institute OrbiO@65 group B replicate 1. Peptide retention time mapping increased within and between institute correlation in CPTAC study 6.

Peptide retention mapping and p-value combination increased the number of true positives while suppressing false positives in a fixed number of top differential proteins. In the CPTAC study 6 dataset, pairs of groups – A vs. B, B vs. C, C vs. D, D vs. E with 3 folds of change in spike-in UPS protein abundance were compared. Intensity with RT mapping (red) significantly increased the number of true positives in the top 50 differential proteins especially in groups with low spike-in abundance- A vs. B, B vs. C. In groups with high spike-in abundance C vs. D, D vs. E, because there were not many missing values in UPS protein intensity values, RT mapping did not significantly change the TP rate. P-value combination using Fisher’s method performed the best among almost all cases.

IDPQuantify accurately estimates protein intensity ratio. (a) Boxplot of log intensity ratios of UPS1 proteins in CPTAC study 6. The intensity of proteins in replicates were averaged and compared between two groups. The black horizontal line indicates the true spike-in ratio (standard) − log(3). With increased protein abundance, intensity ratio estimation accuracy increased, narrowing boxes vertically. There were two outliers below −1 and one above 1.6 which were not plotted. (b) Boxplot of log intensity ratios of UPS proteins in the NRD-pfu dataset. The horizontal line indicates the true spike-in ratio for each cassette with corresponding colors. Each box shows the intensity ratio of UPS proteins in a single replicate (e.g. Replicate 1 A vs. B). The intensity ratios were accurately distributed around the true value in quintuplets. The correlation between the average estimated intensity ratio and the spike-in ratio is 96.72% in Sextuplets, and 95.36% in Quintuplets.

Distribution of peptide intensity ratio in CPTAC study 6 dataset. The vertical black line indicates the true spike-in ratio: log(3). The intensity ratios of peptides were distributed around the true ratio. With increase in protein abundance, the distributions of peptides were more convergent to the true ratio. All UPS peptide ratios estimated by IDPQuantify were centered on the true spike-in ratio.

The real log fold changes in each comparison group were shown by blue horizontal lines. The missing MS1 intensity rate is 12.1% by MaxQuant, 31.8% by SINQ, and 4.9% by idpQuantify. The estimated ratios by idpQuantify were shown in .