An automated procedure for the statistical analysis of two-dimensional electrophoresis gels for biomarkers discovery

Abstract

A protocol for the analysis of two-dimensional electrophoresis images has been written in R, an open-source environment for statistical computing, with the purpose of identifying and verifying potential disease biomarkers. In a stepwise procedure, spot volumes are normalized and missing values are replaced. A quality control procedure permits to efficiently discard gels that should not be included in a comparative statistics. Spots associated with confounding factors are identified and discarded from any further analysis. Comparisons between groups are performed with both parametric and non parametric tests, thus allowing researchers to choose the most appropriate condition. Hereafter, a classification model is built by linear discriminant analysis. The model is then verified so to obtain expected sensibility and specificity values.

Introduction

Proteomics is a fundamental tool for the discovery of disease biomarkers and to find new therapeutic targets (1). Originally proposed as the protein complement of the genome, the proteome recently assumed a more functional significance (2 3 4). In integration with genomics, transcriptomics, bioinformatics and biostatistics, proteomics became an indispensable instrument for translational aspects of modern molecular medicine (5). Proteomics is an intrinsically unbiased approach, aimed at generating a list of candidate proteins deserving further targeted studies (6). Such a global approach to biology and medicine allows to manage with the great complexity of the system being investigated, thus overcoming limits of conventional biochemistry or molecular biology tools (7).

In order to be unbiased, proteome should be approached globally. Despite the protein separation technique being employed (either gel-based or gel-free) (8 9 10), it is possible to observe protein levels spanning over three order of magnitude even if they cover almost six orders of magnitude in a cell and up to ten in plasma (11). Although there is a strong technological push towards gel-free techniques, 2-DE continues to be preferred for its capacity to resolve thousands of spots simultaneously and represents the only technique that can be routinely applied for parallel quantitative expression profiling of large sets of complex protein mixtures such as whole cell lysates (12 13). 2-DE gels are at the same time an image of protein distribution (in terms of molecular weight and pI) and levels, and a container of separated proteins available for further characterization 13. Nevertheless, there are intrinsic weaknesses in 2-DE. Due to the reduced dynamic range of the detection technique, less represented proteins are lost, thus limiting the global characteristic of the proteomic approach. Moreover, proteins of extreme hydrophobicity and those of extreme basic or acidic pI are not considered in the majority of published studies (14).

In this protocol we propose a suite of standardized computational procedures to analyze densitometry data from 2-DE gel image analysis. Briefly, data are internally normalized and quality-checked to ensure that the right comparative statistics procedure is applied. Confounding factors are identified in order to exclude spots significantly associated to them from any further analysis. Given the log-normal distribution of magnitude values, a non-parametric test is performed to identify a set of spots that best contribute to a discriminant function, the latter being identified by linear discriminant analysis. Furthermore, selected spots are ranked in terms of their contribution to the predictive model. Eventually, the efficiency of the set of spots is verified by power analysis, predictions are calculated by the leave-one-out cross validation and performance parameters are calculated in terms of receiver operating characteristics (ROC) analysis.

Equipment

2-DE gel images were acquired (12 bit grayscale) with the GelDoc-It Imaging System (UVP, Upland, CA, U.S.A.) and analyzed with ImageMaster 2D Platinum (GE Healthcare, Uppsala, Sweden). All procedures for data analysis and graphics were written in R, an open-source environment for statistical computing (15).

Procedure

1. Import spot volumes and clinical/demographic data

Spot volumes, exported from the ImageMaster software or other image analysis softwares in .csv (comma separated values) format, are read as a data matrix. Each row represents one of the L gels, whereas columns from 3 to N+2 are spot volumes for each of the N detected spots. Column 1 contains the gel labels, whereas column 2 contains the group factor (e.g., control vs. affected). A second matrix is read in .csv format, containing again one row per gel. Columns represent clinical and demographic data (age, gender, years from onset, disease staging, (any) drug daily dose, treated or not with a given drug, etc). Columns are read as independent vectors for further analysis.

2. Normalize 2-DE spot volumes

In each gel, a set of spots common to all gels (if L is small) or to 75% of gels (if L is large) is identified for normalization purposes. The sum of these spot volumes is considered as a loading/staining reference, thus providing L normalization factors. Each line in the spot volume matrix is then divided by the appropriate normalization factor to obtain a new matrix of normalized spot volumes.

3. Replace missing values

After normalization, missing spot values are replaced by the mean value of the spot volume of the group (i.e., control or affected) or, if the mean was lower than the 98th percentile, by the minimum value observed in the group (12). This ensures to safely prevent the artefactual assignment of missing spots (qualitative differences) as false negatives. To this purpose, the data matrix is split into separate submatrices for distinct groups, i.e., controls (co), early-onset patients (eo) and late-onset patients (lo).

4. Check the quality of data sets

Like many other biological observables, 2-DE spot volumes show a log-normal distribution, with less-intense spots more frequent than more-intense ones. To simply verify this, spot volumes are transformed logarithmically and the standard deviation is calculated within the group. Figure 1 shows almost constant values of standard deviations as a function of log(volume), as expected for a log-normal distribution. This means that, in principle, a parametric test is suitable for comparison between groups after logarithmic transformation of spot volumes (16).

Then, the biological variability of the subjects in each group is evaluated. Pairs of 2-DE gels of specimens from different subjects in the same group are compared by Pearson linear correlation of corresponding spot volumes after logarithmic transformation. If the number of specimens (i.e., L) is large, it is preferable to take a single subject per group as a reference, so that all gels in the group are compared to this one. For each comparison, the Pearson correlation coefficient is shown below the scatter graph. Dispersions of the studentized residuals against ranked magnitudes (QQ) plots are shown on the left of each correlation ( Figure 2 ). Spots deviating from linearity indicate non-normal distribution of residuals. A linear behavior indicates linear proportionality between gels. Dashed lines indicate 95% CI. On this basis, a direct proportionality between gels from different subjects may be assumed (16). Otherwise, the gel shall be discarded.

5. Identify spots associated to confounding factors

Confounding factors may be either grouping classes (gender, assuming or not a given treatment, familiarity) or real numbers (age, daily drug dose, age at onset, disease duration, disease staging). In the first case, a nonparametric Wilcoxon test is performed on spot volumes grouped according to the classes, so to determine which spots are significantly different in a class with respect to the other one ( Figure 3 ).

Otherwise, a linear correlation analysis is performed to determine which spots are significantly correlated with the magnitude of the confounding factor (e.g., daily drug dose or age). Important to notice, the analysis of correlation with age should be performed on control subjects only ( Figure 4 ). Similarly, in the drug dose correlation only patients should be included, treated or not with that drug.

Eventually, spots that significantly correlate with one or more confounding factor(s) are removed from the dataset.

6. Calculate p values for all comparisons

This is the most critical step of the procedure. If data are log-normally distributed, a parametric test may be applied after logarithmic transformation of the dataset. A Welch test is preferred, since distinct spots may have different variance. Otherwise, a non-parametric test such as the Wilcoxon test may be a safer choice (16). Working with three groups, five contrast are suggested, in principle. Three are binary contrast (one group vs. another one), a fourth compares patients (both eo and lo) vs. control subjects and the last contrast compares a peculiar group (e.g., eo) vs. all other subjects. In this way, ten p values are computed for each spot (five Welch test and five Wilcoxon test). If the purpose is to identify a biomarker able to discriminate patients from controls, then the threshold for significance should be 0.05/N, where N is the number of spots, and a non-parametric variant of the classic ANOVA test might be preferable (16). However, our procedure might be employed to identify all those spots that are different in the contrast, in order to build a classification function as a linear combination of all the selected spots or of a subset of them.

7. Build a linear discriminant function to classify the individuals and verify the predictive model

Predictive models for the classification of patients with respect to control subjects, or of patients with a particular disease subtype with respect to idiopathic disease patients (e.g., early-onset vs. late-onset) are built by linear discriminant analysis (LDA) (16). Volumes of spots selected as described above are linearly combined in a discriminating function, thus providing a likelihood score for the disease. The predictive model is verified by leave-one-out cross-validation. Each subject is iteratively removed from the dataset and classified on the basis of the spot volumes observed in the remaining individuals. This procedure permits to obtain reliable estimates of sensitivity and specificity. A disease likelihood score is attributed to each subject and a ROC curve (16) is calculated by plotting sensitivity and specificity as a function of the cutoff value.

The contribution of each spot may be of different relevance, depending on the separation of mean volume values and their LDA coefficient. The product of these two parameters indicates the intrinsic contribution of every spot to the discriminating model, so to permit the selective removal of worst-contributing spots. Eventually, a report table is produced to summarize the descriptive statistics results, the fold of change in log2 units, the LDA coefficient and the weighted contribution of each spot to the model (i.e., the product of the difference of mean values times the LDA coefficient) ( Table 1 ). On this basis, a subset of most relevant spots may be selected instead of the list including all identified spots.

8. Make a power analysis

Spots to be included in the classification function have been selected on the basis of a non-parametric test. However, linear discriminant analysis takes into account the separation of two normal distributions. It may be useful, at this point, to check which one would be the appropriate number of subjects to be enrolled in a verification study, assuming a normal distribution of each spot volume across the individuals.

Note: the complete source is available here as a text file.

Timing

The whole procedure usually requires less than five minutes to be completed, depending on the size of the dataset and on the CPU clock of the computer being used.

Troubleshooting

The appropriate choice of the comparative statistics test is a critical issue. As we discussed above, proteomics data – including 2-DE spot volumes – are expected to show a log-normal distribution. In this case, the researcher faces two possibilities: either data are transformed on a logarithmic scale and are analyzed by parametric tests, possibly considering a different variance for the compared groups, or data are compared on a linear scale after ranking, i.e. with a non-parametric test. Our procedure offers both possibilities, with a parametric Welch test or a non-parametric Wilcoxon test being performed on all spots. Related to this issue is the choice of the p value used as a threshold for statistical significance. When dealing with multiple comparisons, the p value should be corrected for the number of observables being observed (Bonferroni correction) (16). Nevertheless, it is up to the investigator to apply any correction of the p value. Actually, identified spots should not be considered per se as being significant in discriminating a patient group from a control group. Rather, a larger number of spots should be included in the LDA procedure to further refine them in a separate step.

LDA is a parametric approach, therefore it may not be effective in the classification of subjects based on multiple comparisons obtained with a non-parametric test. For this reason, an accurate refinement of the spots to be included in the model is advisable. We tackled this problem by progressively refining the list of candidate spots on the basis of their contribution to the model. When the reported procedure was applied (17), we started from a 20-spot model and progressively discarded the six spots showing the worst contribution (weight < 0.3) and the LDA was performed on the remaining 14 spots. In a further refinement, we discarded additional spots with weight < 1, thus obtaining a 9-spot model. A comparison of the performance of the three predictive models is shown in Table 2 (17).

Anticipated Results

A correct analysis of proteomic data is fundamental for the identification of candidate biomarkers. First, an adequate assessment of the gel quality grants a remarkable spare of time. Indeed, the inclusion of poorly resolved gels or bad quality gel images in the dataset leads to loss of significance of potentially interesting results. Any foreseeable correlation with confounding factors has to be taken into account, age and gender being just an example. Using this approach, we identified several spots that are correlated with dopaminergic therapies (18) and excluded them from the further analysis. Eventually, LDA allowed us to set up a discriminant function based on a linear combination of a set of independent observations, that dramatically improves sensitivity and specificity of the proposed diagnostic tool. We anticipate that this procedure might permit a rapid screening of 2-DE spots to be efficiently included in a panel of protein biomarkers for molecular diagnostics.

Acknowledgements

Figures

Values of standard deviations as a function of log(volume) are almost constant, as expected for a log-normal distribution. Open circles: control group. Filled circles: patients group.

Figure 2: Gel quality control

Left panel: scatter plot of a pair of gels belonging to the same group. The Pearson correlation coefficient is reported below, together with the slope of the straight line. Right panel: quantile-quantile plot for the residuals of the linear fit in panel A. A linear correlation between t quantiles and Studentized residuals indicates that gels are comparable.

Figure 3: Identification of spots associated with the gender

Box and whiskers plot for the volume of Spot 391 in females (green) and males (red).

Figure 4: Identification of spots correlated with the age

Linear correlation of the volume of Spot 513 vs. the age of control subjects.

Table 1: Summary of performance parameters for the three predictive models

A report table is produced to summarize the descriptive statistics results, the fold of change in log2 units, the LDA coefficient and the weighted contribution of each spot to the model (i.e., the product of the difference of mean values times the LDA coefficient).