A More Efficient Approach to Biostatistical Data Analysis

ByKristen Amuzzini

Sept 15, 2005 | Traditional biostatistical analysis focused on identifying and understanding the functions of individual genes, proteins, and cells. Today, the emphasis has shifted to studying the organism as an integrated network of genes, proteins, and biochemical reactions.

The information needed to fully understand the functioning of an organism typically comes from a wide range of sources and instruments: DNA microarrays, for evaluating the expression of a large number of genes; mass spectrometers, for identifying larger, more complex proteins; nuclear magnetic resonance (NMR) spectrometers, for identifying smaller molecules such as metabolites; and Web-based molecular biology databases from organizations such as the National Center for Biotechnology Information (NCBI).

This multiplicity of information sources complicates the job of the biostatistician. Data obtained from the many sources used in biological research today typically come in incompatible formats. For example, most instrument manufacturers provide software that can be used only to interpret and analyze the data produced by their instruments. These software packages rarely support the latest data analysis algorithms. As a result, data from different sources must be analyzed separately in their native environment, a fragmented approach that makes it difficult to gain a systems-level understanding of an organism.

The increased diversity of data sources demands more flexible biostatistical analysis methods, as well as software tools that enable the interactions of the numerous genes, proteins, mechanisms, and the organism’s external environment to be integrated, analyzed, and visualized in a single environment.

This approach uses statistical analysis and visualization tools in the MATLAB software environment, making it easy to access data from a wide range of data sources — for example, sequence data in standard formats, such as FASTA and PDB; microarray data from Affymetrix, Agilent, and other platforms; and information from major Web-based databases, such as GenBank and NCBI BLAST. The new tools provide genomic and proteomic data formats, analysis techniques, and specialized visualizations for genomic and proteomic sequence and microarray analysis.

Using the optimized approach, the biostatistician can take advantage of the strengths of many different instruments and data sources without investing time in manual data processing. For example, a single statistical analysis can be performed on a data set containing microarray, mass spectrometry, and NMR data.

Analyzing and Visualizing Microarray DataThe example discussed here uses microarray data from voxel A1 of the brain of a mouse in which a pharmacological model of Parkinson’s disease was induced using methamphetamine.*

The data are read into the MATLAB workspace, where a spatial plot of the microarray image is created, together with a field showing median pixel values in the various color channels. Spatial effects are readily apparent in the background intensities. The data are normalized to remove this spatial bias. Next, scatter plots of the microarray data are generated to measure expression levels. Points above the diagonal in this plot correspond to genes with expression levels that are higher in the A1 voxel than in the brain itself.

These same analysis and visualization tools are used to process raw mass spectrometry data, which are stored in text files with two columns, the mass/charge (M/Z) values and intensity values corresponding with the M/Z ratios. Our example uses spectrograms taken from one of the low-resolution ovarian cancer NCI/FDA data sets. The spectra are generated using the WCX2 protein-binding chip. Re-sampling the mass spectrometry data homogenizes the M/Z vector, making it possible to compare different spectra under the same reference and at the same resolution.

Mass spectrometry data usually show a varying baseline, caused by chemical noise in the matrix or by ion overloading. An integrated data analysis and visualization environment makes it easy to estimate a low-frequency baseline, which is hidden among the high-frequency noise and signal peaks and can be subtracted from the spectrogram.

By the same token, mass spectrometers that have been calibrated differently exhibit variations in the relation between the observed M/Z vector and the true time of flight of the ions. There, systematic shifts can be observed in repeated experiments. Misaligned spectrograms can be corrected by providing a set of M/Z values where reference peaks are expected to appear. A heat map is used to observe the alignment of the spectra before and after applying the alignment algorithm. In these experiments, systematic differences are observed in the total amount of desorbed and ionized proteins. To compensate for this, the relative intensities of the spectrograms are normalized. After preprocessing, the data are ready for biomarker detection, which can be performed with the same analysis tools.

Using this efficient, integrated approach, data from many different sources can easily be acquired, massaged to ensure its integrity, and then combined into single tables that biostatisticians can use to search for patterns regardless of the source of the data. The completed statistical analysis application can be deployed to researchers as an Excel spreadsheet or as a standalone executable with a graphical user interface.__________________________

Kristen Amuzzini is the biotech and pharmaceutical industry marketing manager for The MathWorks where she specializes in computational biology. Kristen spearheads an effort to foster industry and academic adoption of MATLAB based tools for biological data analysis. She would like to thank Sam Roberts and Rob Henson of The MathWorks for their contributions to this article. E-mail: kristen.amuzzini@mathworks.com.