The Top 40 Under 40 Power List returns to celebrate the gifted young scientists making waves in analytical science. Here we present the rising stars of the field (in alphabetical order), as nominated by our readers and shortlisted by our independent judging panel.

Cookies

Like most websites The Analytical Scientist uses cookies. In order to deliver a
personalized, responsive service and to improve the site, we remember and store information about how you use
it.
Learn more.

Translational Proteomics: Solving the Reproducibility Riddle

Data analysis in proteomics is not fit for purpose – here’s how we can get on track.

David Chiang
|
04/05/2018

Proteomics, with its unlimited potential for biomedicine, has so far fallen short. I believe the reason is simple: sophisticated big data is being processed by simplistic bioinformatics with underpowered computers. Novices are dazzled by thousands of proteins characterized at the push of a button. But experts find that it is mostly common proteins that are correctly identified, much of the quantitation is suspect, and – critically – it is hard to tell whether an identification is correct. How can we improve the utility of proteomics for identifying important low-abundance proteins? The trick is to borrow data analysis from numerical data mining in physics, not abstract statistics.

Let’s say we run a pneumonia sample to identify pathogens from proteins with a mass spectrometer. We process a gigabyte file with 50K raw spectra with a fast PC program that identifies and quantifies peptides and proteins from 20 percent of the spectra at 1 percent error. When analysis is so easy, who needs hypotheses or data understanding? We just need “better” software – defined as faster and cheaper and reporting more proteins. Of course, this assumes 1 percent error is enough, a self-estimated error is always robust, and quantity means quality – all of which are obviously incorrect.

As an analogy, imagine a novel space telescope with revolutionary accuracy, which eases data analysis; no cosmologist would acquire ad hoc imaging data and then shop for prototype software that identifies the most stars for publication, sight unseen. This unscientific approach would find thousands of bright stars but give irreproducible discoveries of faint ones. Content-rich physical data are heterogeneous, with varying signal-to-noise. Deep data require exponentially more computing to mathematically scrub.

Clinical research requires 100 percent accuracy for a few low-abundance proteins, not 99 percent including thousands of irrelevant abundant ones. It requires a precision paradigm centered on raw data, not probability models.

In my view, the narrative that omics means hypothesis-free science is fundamentally flawed.

In conventional proteomics, data interpretation is outsourced to calculations few understand. Researchers choose a subjective search engine, rely on subjective probabilities to judge peptide IDs, depend on Bayesian inference to aggregate peptide IDs to identify a protein, and evaluate results quality with a single error estimate.

A precise and rigorous abstraction requires three changes. First, simplify protein inference by representing each protein with its longest identified peptide (ideally long enough to be protein-unique). Second, peptide ID filtering should use only physical mass data, not model-based parameters, such as search scores. Finally, the search engine must be demoted from a central role to merely an “educated guesser” of peptide ID hypotheses to be mass-filtered.

For example, in infection research, we develop a hypothesis, acquire data, and then interpret data. The experimental goal is to identify and characterize at least one critical peptide from its noisy spectrum. Importantly, this analysis can be manually validated by an expert.

We may hypothesize a certain pathogen, design a data-independent acquisition (DIA) experiment to maximize the odds of finding certain protein-identifying peptides, then do perhaps a dozen runs to try to capture literally one-in-a-million spectra relevant to our hypothesis. Deep research is inherently a numbers game; new technologies just help increase the odds.

In my view, the narrative that omics means hypothesis-free science is fundamentally flawed. The role of computers and artificial intelligence is to assist – not to replace – scientists who formulate hypotheses and interpret data.

Enjoy our FREE content!

Log in or register to gain full unlimited access to all content on the
The Analytical Scientist site. It’s FREE and always will be!