Contents

Improvements to the Percolator algorithm for peptide identification from shotgun proteomics data sets

Abstract:
Shotgun proteomics coupled with database search software allows the identification of a large
number of peptides in a single experiment. However, some existing search algorithms, such as
SEQUEST, use score functions that are designed primarily to identify the best peptide for a given
spectrum. Consequently, when comparing identifications across spectra, the SEQUEST score
function Xcorr fails to discriminate accurately between correct and incorrect peptide identifications.
Several machine learning methods have been proposed to address the resulting classification task of
distinguishing between correct and incorrect peptide-spectrum matches (PSMs). A recent example
is Percolator, which uses semi-supervised learning and a decoy database search strategy to learn to
distinguish between correct and incorrect PSMs identified by a database search algorithm. The
current work describes three improvements to Percolator. (1) Percolator’s heuristic optimization is
replaced with a clear objective function, with intuitive reasons behind its choice. (2) Tractable
nonlinear models are used instead of linear models, leading to improved accuracy over the original
Percolator. (3) A method, Q-ranker, for directly optimizing the number of identified spectra at a
specified q value is proposed, which achieves further gains.