Bottom Line:
The combination of ANOVA and redundancy exploitation allows for identification of biomarker candidates in multi-dimensional MALDI-TOF MS profiling studies with complex experimental design.With respect to feature selection our method provides a fast and intuitive alternative to global optimization strategies with comparable performance.The method is implemented in R and the scripts are available by contacting the corresponding author.

Background: Diabetes like many diseases and biological processes is not mono-causal. On the one hand multi-factorial studies with complex experimental design are required for its comprehensive analysis. On the other hand, the data from these studies often include a substantial amount of redundancy such as proteins that are typically represented by a multitude of peptides. Coping simultaneously with both complexities (experimental and technological) makes data analysis a challenge for Bioinformatics.

Results: We present a comprehensive work-flow tailored for analyzing complex data including data from multi-factorial studies. The developed approach aims at revealing effects caused by a distinct combination of experimental factors, in our case genotype and diet. Applying the developed work-flow to the analysis of an established polygenic mouse model for diet-induced type 2 diabetes, we found peptides with significant fold changes exclusively for the combination of a particular strain and diet. Exploitation of redundancy enables the visualization of peptide correlation and provides a natural way of feature selection for classification and prediction. Classification based on the features selected using our approach performs similar to classifications based on more complex feature selection methods.

Conclusions: The combination of ANOVA and redundancy exploitation allows for identification of biomarker candidates in multi-dimensional MALDI-TOF MS profiling studies with complex experimental design. With respect to feature selection our method provides a fast and intuitive alternative to global optimization strategies with comparable performance. The method is implemented in R and the scripts are available by contacting the corresponding author.

Figure 4: Cluster Dendrogram. Cluster dendrogram of all peaks identified in this dataset (see the Methods section for details). Every node is characterized by four ANOVA p-values shown as a color-coded box with four fields: diet (upper left), genotype (upper right), time (lower right) and combination of diet and genotype (lower left). The different -log10 p-value colorscales for the four factors are shown at the bottom. Three clusters for further discussion (see text) are marked with red circles.

Mentions:
In parallel to ANOVA an average linkage clustering was performed. The cluster dendrogram combining correlated peptides and ANOVA p-values (see Figure 4) was calculated as described in the Methods section. The experimental factors have different impact on the data (see Figure 4). The most significant p-values are obtained for genotype (up to 10 -91 ). The different mouse types can be easily distinguished using the profile data. Diet and the combination of genotype and diet seem to have a much smaller but still substantial effect on the data (p-values of up to 10 -14 ) whereas time has an even greater effect (p-values of up to 10 -23 ). Nearly one third of all peaks - the whole right part of the dendrogram - is associated with the experimental factor time. On this global level the dendrogram allows an intuitive overview of the complete data set as both similarity and significance information are shown in a unified representation.

Figure 4: Cluster Dendrogram. Cluster dendrogram of all peaks identified in this dataset (see the Methods section for details). Every node is characterized by four ANOVA p-values shown as a color-coded box with four fields: diet (upper left), genotype (upper right), time (lower right) and combination of diet and genotype (lower left). The different -log10 p-value colorscales for the four factors are shown at the bottom. Three clusters for further discussion (see text) are marked with red circles.

Mentions:
In parallel to ANOVA an average linkage clustering was performed. The cluster dendrogram combining correlated peptides and ANOVA p-values (see Figure 4) was calculated as described in the Methods section. The experimental factors have different impact on the data (see Figure 4). The most significant p-values are obtained for genotype (up to 10 -91 ). The different mouse types can be easily distinguished using the profile data. Diet and the combination of genotype and diet seem to have a much smaller but still substantial effect on the data (p-values of up to 10 -14 ) whereas time has an even greater effect (p-values of up to 10 -23 ). Nearly one third of all peaks - the whole right part of the dendrogram - is associated with the experimental factor time. On this global level the dendrogram allows an intuitive overview of the complete data set as both similarity and significance information are shown in a unified representation.

Bottom Line:
The combination of ANOVA and redundancy exploitation allows for identification of biomarker candidates in multi-dimensional MALDI-TOF MS profiling studies with complex experimental design.With respect to feature selection our method provides a fast and intuitive alternative to global optimization strategies with comparable performance.The method is implemented in R and the scripts are available by contacting the corresponding author.

Background: Diabetes like many diseases and biological processes is not mono-causal. On the one hand multi-factorial studies with complex experimental design are required for its comprehensive analysis. On the other hand, the data from these studies often include a substantial amount of redundancy such as proteins that are typically represented by a multitude of peptides. Coping simultaneously with both complexities (experimental and technological) makes data analysis a challenge for Bioinformatics.

Results: We present a comprehensive work-flow tailored for analyzing complex data including data from multi-factorial studies. The developed approach aims at revealing effects caused by a distinct combination of experimental factors, in our case genotype and diet. Applying the developed work-flow to the analysis of an established polygenic mouse model for diet-induced type 2 diabetes, we found peptides with significant fold changes exclusively for the combination of a particular strain and diet. Exploitation of redundancy enables the visualization of peptide correlation and provides a natural way of feature selection for classification and prediction. Classification based on the features selected using our approach performs similar to classifications based on more complex feature selection methods.

Conclusions: The combination of ANOVA and redundancy exploitation allows for identification of biomarker candidates in multi-dimensional MALDI-TOF MS profiling studies with complex experimental design. With respect to feature selection our method provides a fast and intuitive alternative to global optimization strategies with comparable performance. The method is implemented in R and the scripts are available by contacting the corresponding author.