Abstract Information :

Using Fourier transform ion cyclotron resonance mass spectrometry (FTICR MS),
scientists are able to determine an unprecedented number of components in crude
oil. The statistical tools required to analyse the mass spectra struggle to keep pace
with advancing instrument capabilities and increasing quantities of data. Today, most
ultrahigh resolution analyses for petroleum samples are based on very limited
numbers of mass spectra per sample. Today, as researchers often base findings on
single experiments with labour-heavy approaches, it can be challenging to monitor
repeatability and differentiate between noise and true signals. As a result, mistakes
and false positive findings can be common. One of the difficulties faced is the
reliable differentiation of reliable peaks from noise; if selecting peaks by signal-tonoise
ratio alone, it is common that genuine peaks can be removed if the threshold is
too high, or that noise peaks result in false positives if the threshold is set too low.

At first glance, false positive peaks often appear in a single mass spectrum while
reliable peaks will appear in multiple (if not all) samples. By combining information
across datasets, we can get more reliable information with a smaller margin for error.
We present a new algorithm developed in R, named Themis, to jointly pre-process
replicate measurements of a complex sample. This improves consistency as a
preliminary step to assigning chemical compositions, and the algorithm has a quality
control criterion. Through the use of peak alignment and an adaptive mixture modelbased
strategy, it is possible to distinguish true peaks from noise.

We applied Themis to a variety of crude oils and naphthenic acid samples. These
results demonstrated a more effective removal of noise-related peaks and the
preservation and improvement of the chemical composition profile. Applied to the
NIST crude oil sample, the use of Themis resulted in a decrease from more than
16000 peaks to 2260 peaks but didn't changed the compositional assignment of the
high intensity N1 class and the root mean square (RMS) improved from 0.24 ppm to
0.22 pm. The low intensity NS class saw an improvement in its compositional
assignment with well distributed series, removal of isolated assignments and a
reduction of the RMS from 0.38 ppm to 0.21 ppm.

Themis, therefore, affords greater success with the assignment of chemical
compositions to low-intensity peaks using petroleomics software. In addition,
improved monitoring of data quality and handling of replicate datasets will allow
researchers to increase processing of larger numbers of samples with greater
confidence. The algorithm will soon be made available for academic use via a web
server.