A probe-treatment-reference (PTR) model for the analysis of oligonucleotide expression microarrays.

Abstract

BACKGROUND:

Microarray pre-processing usually consists of normalization and summarization. Normalization aims to remove non-biological variations across different arrays. The normalization algorithms generally require the specification of reference and target arrays. The issue of reference selection has not been fully addressed. Summarization aims to estimate the transcript abundance from normalized intensities. In this paper, we consider normalization and summarization jointly by a new strategy of reference selection.

RESULTS:

We propose a Probe-Treatment-Reference (PTR) model to streamline normalization and summarization by allowing multiple references. We estimate parameters in the model by the Least Absolute Deviations (LAD) approach and implement the computation by median polishing. We show that the LAD estimator is robust in the sense that it has bounded influence in the three-factor PTR model. This model fitting, implicitly, defines an "optimal reference" for each probe-set. We evaluate the effectiveness of the PTR method by two Affymetrix spike-in data sets. Our method reduces the variations of non-differentially expressed genes and thereby increases the detection power of differentially expressed genes.

CONCLUSION:

Our results indicate that the reference effect is important and should be considered in microarray pre-processing. The proposed PTR method is a general framework to deal with the issue of reference selection and can readily be applied to existing normalization algorithms such as the invariant-set, sub-array and quantile method.

The scheme of the PTR method. It includes the reference and target selection, multiple normalization, and three-factor model fitting of summarization. Here, we only illustrate the cross strategy for the reference and target selection.

M-A plots of the perturbed data set using different normalization and reference selections. Top (A1-A3): invariant-set; middle (B1-B3): quantile; bottom (C1-C3): sub-array. Left column (A1, B1 and C1): the reference is the perturbed array Exp03_R1*; Middle column: the reference in both A2 and C2 is Exp03_R2, while the reference in B2 is the pseudo-reference defined as the average quantiles of all six arrays; Right column (A3, B3 and C3): the result obtained by the PTR method using all six arrays as references. The grey dots are non-spike-in genes; the black dots are spike-in genes which are expected to have log-ratio M = 1. We can see that the PTR method results are not affected by the perturbed array Exp03_R1* and offers the smallest variation for non-spike-in genes.

The LOESS curves of |M| versus A by various pre-processing method. These plots compare the PTR method with other pre-processing methods based on the variation assessment of non-spike-in genes. The PTR method gives the smallest variation for all three normalization algorithms.

Distribution of the reference effect. This reference effect box plot is get from the PTR model-fitting on the perturbed data set after the invariant-set normalization. The first reference array, Exp03_R1*, has been perturbed by adding noise. It shows a quite different distribution than others.

The frequency of being the "implicit optimal reference". It illustrates the frequency of reference arrays which have been served as the "implicit optimal reference" across all the probe-sets. It is computed from the residual assessment after the PTR method with the invariant-set normalization on the data set "Expt-3-4".