ACADEMIC & RESEARCH INTEREST

Statistical Methods of Analyzing Gene Expression Data

Expression data are generated by hybridizing transcripts to microarrays or gene chips from tissues under controlled conditions. If one gene regulates (up or down) another gene, or both are involved in a biochemical pathway, the profile of their expressions over time will correlate. Expression data are often analyzed using clustering procedures: clusters represent sets of genes displaying coordinately regulated expression profiles. As expression data contain significant amounts of random variation, and as clusters are dependent on the procedure applied, the assignment of confidence measures to clusters is useful. Specifically, we have implemented an algorithm in the statistical programming language R that assigns confidence measures to groupings of genes obtained by clustering routines. By the use of permutation testing and convex hull methods to simulate pseudo-random gene expression data sets, statistics are obtained from these randomly generated sets to provide a basis for comparison to the original data.

My contribution to the GeneX OpenSource gene expression database and software system [http://sourceforge.net/projects/genex] consists of several gene expression normalization and analysis programs, two of which presents a novel approach to clustering techniques. These methods are being generalized for applications to microarray data generated on different technology platforms (Affymetrix, NimbleGen™, and custom two-color cDNA arrays). Enhancements are being made to include metrics that provide the researcher with (more) biologically meaningful results.

Experimental Design and Normalization Methods

As the accumulation of genetic data continues to grow at a rapid speed, there is a need for immediate data analysis methods to assess experiments as they are in progress. Properties of the experimental design, which provide control and understanding of the source of variation in both signal and noise, affect the manner in which data should be analyzed and appropriate models constructed. The ultimate aim of any gene expression data analyst is to be involved in the experimental design of the microarray. Too often, experiments are placed in the analysts' hands without proper design. Poorly designed experiments most often result in meaningless analysis results, and always increase the efforts (and creativity) of the analyst. I am currently developing several experimental designs of plant and human array experiments, with several sets of both positive and negative controls, and am assessing their performance within different experiments.

Graph-theoretic Modeling of Temporal Gene Expression Data

The analysis of large amounts of microarray data is a significant challenge for the researcher. The parallel assay of thousands of data points, not all of which are independent, across a number of temporal states, provides an interesting platform for statistical analyses and the construction of models. To identify clusters within temporal gene expression profiles is equivalent to finding patterns in time series data. Although standard hierarchical clustering techniques can be applied to this type of data, no standard tools to identify such patterns exist. I have developed a graph-theoretic approach for constructing putative functional network models that suggest hypotheses about functions of unknown genes. This technique has been applied to several experiments of Dr. John Cushman at the University of Nevada Reno, with promising results. Specifically, the experiments measure the expression levels of the common ice plant, Mesembryanthemum crystallinum, under abiotic stress. Ice plant is a facultative halophyte, which can shift from C3 to Crassulacean acid metabolism (CAM) photosynthesis in response to environmental stress conditions such as water stress or conditions of hypersalinity. By understanding the complex adaptive mechanisms of this plant, a long-term goal is the deployment of these processes in agriculturally important crops to improve drought and salinity tolerance. An innovative distance metric is under development to provide a measure of similarity between any pair of genes in a more biologically grounded manner than commonly utilized distance metrics. Using these similarity relations, a bi-directional graph is generated by connecting genes based on their degree of similarity. From this graph one can detect "clusters" within the structure of the graph’s connectivity. These clusters provide hypotheses of gene function and interaction, and guide in the association of genes with biochemical pathway changes involved in stress responses and adaptive mechanisms of the organism under study. An on-going study focuses also on the post-analysis findings and the biological meaning behind clusters, an often-neglected step in microarray analysis.

Modeling Gene Interactions with Combinatorial Methods

Complex networks are often used to model hierarchical social, biological or communication systems, as well as genetic systems. As a first approximation, Boolean networks are often used. As part of my research at the Virginia Bioinformatics Institute with Professor Reinhard Laubenbacher, we developed a method of encoding a Boolean network as a collection of simplicial complexes. We also established a combinatorial analogue of the homotopy theory of topological spaces to analyze these simplicial complexes. The resulting combinatorial invariants provide information on the dynamics of the network. By representing genetic relationships via (Boolean) network structures, applications of combinatorial homotopy theory may reveal overall network behavior and patterns of influence within and across gene subgroups.

Visualization of Microarray Gene Expression Data

An artificial heatmap of the intensity levels of a 2-color cDNA microarray is generated for each channel, and for the background-corrected ratio values. This image allows the user to quickly determine whether any spatial variation appears on the array, or whether control spots are behaving as predicted. Similarly, the tool is applicable to high density oligonucleotide arrays, such as those made by Affymetrix and Nimblegen™. This technique provides the researcher with a bird's eye view of each array in the experiment. The software is written in the R programming language, and is very simple to use and implement.

Visualization of Haplotype Sharing and Fine Mapping using SNP Data

For the analysis of data stemming from our high-throughput genotyping experiments, we have developed a tool that automates the selection of SNPs for fine-mapping genetic associations. The tool generates a graph of genotypes from phased chromosomes that are grouped by haplotype via a hierarchical clustering approach to display long-range linkage disequilibrium patterns for a given allele of interest. We are currently using phased chromosome data from the HapMap project, and among other things, highlight those SNPs included on the Affymetrix 100K SNP GeneChip. These graphs make it possible to identify the haplotypes on which an associated SNP occurs and identify the region likely to contain the causative variant for a given association.

A separate module within HapMapper identifies SNPs that serve to distinguish haplotypes, as well as those in strong linkage disequilibrium with an associated allele, and those that are proxies for other SNPs in the region. These data are integrated into the visual display, aiding in the selection of SNPs for fine mapping haplotypes that contain the associated allele. The software is written in R and has been implemented for our use in fine-mapping several regions of interest.