Research Interests: My research interests are in the area
of model selection, the theory and geometry of mixture
models and functional data analysis. I am especially interested in challenges presented by
"large magnitude", both in the dimension of data vectors and in
the number of vector. Core areas of methodological research
include multivariate mixtures, structural equations models,
high-dimensional clustering and functional clustering.
Key collaborative activities involve projects in immunology,
modeling of climate ecosystem dynamics and medical image segmentation.

Background:
In recent years, intense research efforts have focused on developing methods for automated flow cytometric data analysis. However, while designing such applications, little or no attention has been paid to the human perspective that is absolutely central to the manual gating process of identifying and characterizing cell populations. In particular, the assumption of many common techniques that cell populations could be modeled reliably with pre-specified distributions may not hold true in real-life samples, which can have populations of arbitrary shapes and considerable inter-sample variation.
Results: To address this, we developed a new framework flowScape for emulating certain key aspects of the human perspective in analyzing flow data, which we implemented in multiple steps. First, flowScape begins with creating a mathematically rigorous map of the high-dimensional flow data landscape based on dense and sparse regions defined by relative concentrations of events around modes. In the second step, these modal clusters are connected with a global hierarchical structure. This representation allows flowScape to perform ridgeline analysis for both traversing the landscape and isolating cell populations at different levels of resolution. Finally, we extended manual gating with a new capacity for constructing templates that can identify target populations in terms of their relative parameters, as opposed to the more commonly used absolute or physical parameters. This allows flowScape to apply such templates in batch mode for detecting the corresponding populations in a flexible, sample-specific manner. We also demonstrated different applications of our framework to flow data analysis and show its superiority over other analytical methods.
Conclusions: The human perspective, built on top of intuition and experience, is a very important component of flow cytometric data analysis. By emulating some of its approaches and extending these with automation and rigor, flowScape provides a flexible and robust framework for computational cytomics.

We present a new approach to factor rotation for functional data.
This rotation is achieved by rotating the functional principal components
towards a pre-defined space of periodic functions designed
to decompose the total variation into components that are nearlyperiodic
and nearly-aperiodic with a pre-defined period. We show
that the factor rotation can be obtained by calculation of canonical
correlations between appropriate spaces which makes the methodology
computationally efficient. Moreover we demonstrate that our
proposed rotations provide stable and interpretable results in the
presence of highly complex covariance. This work is motivated by
the goal of finding interpretable sources of variability in vegetation
index obtained from remote sensing instruments and we demonstrate
our methodology through an application of factor rotation of this
data.

Bayes Factors play an important role in comparing the fit of models ranging from multiple regression
to mixture models. Full Bayesian analysis calculates a Bayes Factor from an explicit
prior distribution. However, computational limitations or lack of an appropriate prior sometimes
prevent researchers from using an exact Bayes Factor. Instead, it is approximated, often
using Schwarz’s (1978) Bayesian Information Criterion (BIC), or a variant of the BIC. In this
paper we provide a comparison of several Bayes Factor approximations, including two new
approximations, the SPBIC and IBIC. The SPBIC is justified by using a scaled unit information
prior distribution that is more general than the BIC’s unit information prior, and the IBIC
approximation utilizes more terms of approximation than in the BIC. In a simulation study we
show that several measures perform well in large samples, that performance declines in smaller
samples, and that SPBIC and IBIC can provide improvement to existing measures under some
conditions, including small sample sizes. We then illustrate the use of the fit measures in an
empirical example from the crime data of Ehrlich (1973). We conclude with recommendations
for researchers.

The main result of this article states that one can get as many as D + 1 modes from a
two component normal mixture in D dimensions. Multivariate mixture models are widely used
for modeling homogeneous populations and for cluster analysis. Either the components directly
or modes arsing from these components are often used to extract individual clusters. Though in
lower dimensions these strategies work well, our results show that high dimensional mixtures are
often very complex and researchers should take extra precaution while using mixtures for cluster
analysis. Even in the simplest case of mixing only two normal components in D dimensions, we
can show that it can have a maximum of D + 1 modes. When we mix more components or if
the components are non-normal the number of modes might be even higher, which might lead us
to wrong inference on the number of clusters. Further analyses show that the number of modes
depend on the component means and eigenvalues of the ratio of the two component covariance
matrices, which in turn provides a clear guideline as to when one can use mixture analysis for
clustering high dimensional data.

Modalclust is an R package which performs Hierarchical Mode Association Clustering (HMAC)
along with its parallel implementation (PHMAC) over several processors. Modal clustering tech-
niques are especially designed to efficiently extract clusters in high dimensions with arbitrary density
shapes. Further, clustering is performed over several resolutions and the results are summarized
as a hierarchical tree, thus providing a model based multi resolution cluster analysis. Finally
we implement a novel parallel implementation of HMAC which performs the clustering job over
several processors thereby dramatically increasing the speed of clustering procedure especially for
large data sets. This package also provides a number of functions for visualizing clusters in high
dimensions, which can also be used with other clustering softwares.

Background: The widely used k top scoring pair (k-TSP) algorithm is a simple yet
powerful parameter-free classifier. It owes its success in many cancer microarray datasets
to an effective feature selection algorithm that is based on relative expression ordering of
gene pairs. However, its general robustness does not extend to some difficult datasets,
such as those involving cancer outcome prediction, which may be due to the relatively
simple voting scheme used by the classifier. We believe that the performance can be
enhanced by separating its effective feature selection component and combining it with a
powerful classifier such as the support vector machine (SVM). More generally the top
scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally
reduced subspace for other machine learning classifiers.

Results: We developed an approach integrating the k-TSP ranking algorithm (TSP) with
other machine learning methods, allowing combination of the computationally efficient,
multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We
evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known
data structures. As compared with other feature selection methods, such as a univariate
method similar to Fisher’s discriminant criterion (Fisher), or a recursive feature
elimination embedded in SVM (RFE), TSP is increasingly more effective than the other
two methods as the informative genes become progressively more correlated, which is
demonstrated both in terms of the classification performance and the ability to recover
true informative genes. We also applied this hybrid scheme to three cancer prognosis
datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves
either comparable or superior performance to that using SVM alone. In
concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and
RFE in two of the three cancer datasets.

Conclusions: The k-TSP ranking algorithm can be used as a computationally efficient,
multivariate filter method for feature selection in machine learning. SVM in combination
with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets
and in some cancer prognosis datasets. Simulation studies suggest that as a feature
selector, it is better tuned to certain data characteristics, i.e. correlations among
informative genes, which is potentially interesting as an alternative feature ranking
method in pathway analysis.

Often researchers must choose among two or more structural equation models for a given set
of data. Typically, selection is based on having the highest chi-square p-value or the highest fit
index such as the CFI or RMSEA. Though a common situation, there is little evidence on the
performance of these fit indices in choosing between models. In other statistical applications,
Bayes Factor approximations such as the BIC are commonly used to select between models,
but these are rarely used in SEMs. This paper examines several new and old Bayes Factor
approximations along with some commonly used fit indices to assess their accuracy in choosing
the true model among a broad set of false models. The results show that the Bayes Factor
Approximations outperform the other fit indices. Among these approximations one of the new
ones, SPBIC, is particularly promising. The commonly used chi-square p-value and the CFI,
IFI, and RMSEA do much worse.

Protein microarrays are a high-throughput technology capable of generating large quantities of proteomics data. They can be used for general research or for clinical diagnostics. Bioinformatics and statistical analysis techniques are required for interpretation and reaching biologically relevant conclusions from raw data. We describe essential algorithms for processing protein microarray data, including spot-finding on slide images, Z score, and significance analysis of microarrays (SAM) calculations, as well as the concentration dependent analysis (CDA). We also describe available tools for protein microarray analysis, and provide a template for a step-by-step approach to performing an analysis centered on the CDA method. We conclude with a discussion of fundamental and practical issues and considerations.

Background:
Tumor-specific antigens and their specific epitopes are formulation targets
for patientspecific
cancer vaccines. A selection of prediction servers are available for
identification of peptides that bind major histocompatibility complex class I
(MHC-I)
molecules. However, the lack of standardized methodology and large number of
human MHC-I molecules, make the selection of appropriate prediction servers
difficult. This study reports a comparative evaluation of thirty prediction
servers for
seven human MHC-I molecules.
Results
Of 147 individual predictors 39 have shown excellent, 47 good, 33 marginal,
and 28
poor ability to classify binders from non-binders. The classifiers for
HLA-A*0201,
A*0301, A*1101, B*0702, B*0801, and B*1501 have excellent, and for A*2402
moderate classification accuracy. In addition, 16 prediction servers predict
peptide
binding affinity to MHC-I molecules with high accuracy; correlation
coefficients
ranging from r=0.55 (B*0801) to r=0.87 (A*0201).
Conclusions
Non-linear predictors outperform matrix-based predictors, and majority of
predictors
can be improved by non-linear transformations of their raw prediction
scores. The
best predictors of peptide binding (both classification and binding affinity)
show the
best performance in prediction of T-cell epitopes. We propose a new standard
for
prediction of MHC-I binding Ð a common scale for normalization of prediction
scores, that is applicable to both experimental and predicted scores.

In this article we propose a general class of risk measures which can be
used for data based evaluation of parametric models.
The loss function is defined as generalized quadratic distance between the true
density and the proposed model. These distances are
characterized by a simple quadratic form structure that is adaptable through
the choice of a nonnegative definite kernel and a bandwidth parameter. Using asymptotic results
for
the quadratic distances we build a quick-to-compute
approximation for the risk function. Its derivation is analogous to
the Akaike Information Criterion (AIC), but unlike AIC, the quadratic risk is a global comparison
tool.
The method does not require resampling, a great advantage when point estimators are expensive to
comput$

This work builds
a unified framework for the study of quadratic form distance measures as
they are used in assessing the goodness of fit of models. Many important
procedures have this structure, but the theory for these methods is
dispersed and incomplete. Central to the statistical analysis of these
distances is the spectral decomposition of the kernel that generates the
distance. We show how this determines the limiting distribution of natural
goodness of fit tests. Additionally, we develop a new notion, the spectral
degrees of freedom of the test, based on this decomposition. The degrees of
freedom are easy to compute and estimate, and can be used as a guide in the
construction of useful procedures in this class.

Background: A key step in the development of an adaptive immune response to pathogens or vaccines is the
binding of short peptides to molecules of the Major Histocompatibility Complex (MHC) for presentation to T
lymphocytes, which are thereby activated and dierentiate into effector and memory cells. The rational design
of vaccines consists in part in the identication of appropriate peptides to effect this process. There are several
algorithms currently in use for making such predictions, but these are limited to a small number of MHC molecules
and have good but imperfect prediction power.
Results: We have undertaken an exploration of the power gained by taking advantage of a natural representation
of the amino acids in terms of their biophysical properties. We used several well-known statistical classiers using
either a naive encoding of amino acids by name or an encoding by biophysical properties. In all cases, the encoding
by biophysical properties leads to substantially lower misclassication error.
Conclusion Representation of amino acids using a few important bio-physio-chemical property provide a natural
basis for representing peptides and greatly improves peptide-MHC class I binding prediction.

In this paper, we develop a mode-based clustering approach applying new
optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This clustering method shares the major advantages of mixture model based clustering. Moreover, it requires no model fitting and ensures that every cluster corresponds to a bump of the density. A hierarchical clustering algorithm is also developed by applying MEM recursively to kernel density estimators with increasing bandwidths. The issue of diagnosing clustering results is investigated. Specifically, a pairwise cluster separability measure is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is developed to guarantee strong separation between clusters. Experiments demonstrate that our clustering approach tends to combine the strengths of mixture-model-based and linkage-based clustering. Tests on both simulated and real data show that the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling.

Describing the probability densities of multi-object complexes by
describing individual objects and their inter-object relationships
leads to desirable locality without ignoring the context of an
object. We describe a means of decomposing object variations into
self effects and neighbor effects. We describe an approach for
estimating the self and neighbor effect probability densities for
each object in the complex using augmentation and prediction,
supported by PGA on m-reps. We apply this method to the inter-day
variation of m-reps of male pelvic organs within an individual
patient.

The advancing technology for automatic segmentation of medical images should
be accompanied by techniques to inform the user of the credibility of results.
To the extent that this technology produces clinically acceptable segmentations
for a significant fraction of cases there is a risk that the clinician will assume
every result is acceptable. In the less frequent case where segmentation fails
we are concerned that unless the user is alerted by the computer, she would
5A5Astill put the result to clinical use. We propose an automated method to signal
suspected noncredible regions of the segmentation, triggered by outlier values
of the local image match function. The user can focus her validation resources
on the noncredible regions.
When the local image match function is computed via a Mahalanobis dis-
tance, as is the case for PCA-based matches, its value follows the chi-squared
distribution. Our method signals a noncredible region wherever the probability
of a chi-squared random variable being greater than the match observed is above
a threshold level.
ROC analysis validates our noncredibility test on m-rep segmentations of
the bladder in CT images, using an image match computed by PCA on regional
intensity quantile functions. We approximate ground truth as truly noncredible
regions have surface distance > 5mm to a reference segmentation. We swept
out ROC curves by varying the threshold level. The area under the ROC curve
was 0.91. Based on this preliminary result, our method shows potential for
validation in an automatic segmentation pipeline.

We study the effect of degrees of freedom on the level and power of quadratic distance based tests. The concept of an \textit{eigendepth index} is introduced and discussed in the context of selecting the optimal degrees of freedom, where optimality refers to high power. We introduce the class of diffusion kernels
by the properties we seek these kernels to have and give a method for constructing them by exponentiating the rate matrix of a Markov chain. Product kernels and their spectral decomposition are discussed and shown useful for high dimensional data problems.

This research was initiated by the analysis of NCI60 cancer dataset . The dataset contains gene expression values (from cDNA arrays) corresponding to 3509 genes collected from 60 different patients diagnosed with 8 different cancer types (assumed unknown in the following discussion). The goal is to provide a model based approach for simultaneously clustering cancer types (columns) and the genes (rows) involved in differentiating these cancer types. We formulate a novel two-way mixture framework and adapt our distance-based model selection tool to determine the unknown number of row and column clusters. This methodology avoids two major pitfalls of using model-based clustering in high-dimensions. First, the two-way mixture has a considerably smaller parameter set, compared to the full multivariate analysis, making all parameters estimable. Second, unlike the complex distribution of likelihood-ratio-based tools under the composite null hypothesis of fixed row and column clusters, the distribution of our distance-based model selection tool is well defined, even for composite hypotheses. Finally, based on the geometry of pure Gaussian HDLSS data, we provide an effective visual diagnostic tool to uncover any remaining structure in the data. Through our analysis, we uncovered some interesting sets of gene clusters. But some of our cancer-type clusters did not match the initial cancer labels. On later verification we found that such discordance was due to the close similarity in symptoms and pathological test results of the two types of cancer in question.

Multivariate mixtures provide flexible methods for both fitting and
partitioning high-dimensional data. Ray and Lindsay(2005) show that the
topography of multivariate mixtures, in the sense of their key features as a
density, can be analyzed rigorously in lower dimensions by use of a ridgeline
manifold that contains all critical points as well as the ridges of the
density. In addition, we have developed a new computing procedure similar to
the EM algorithm that can quickly find the modes of a mixture density.
This tool can be extended to examine the degree of separation
between the modes based on the ridgeline separating them.
These tools can be used in various ways. For one, we can take a conventional
mixture analysis and cluster together those components
whose contribution is actually unimodal. This cluster could then
represent a single true component with a more complex distribution.
We can also turn kernel density estimation into a hierarchical clustering tool
in which the data points become identified with each other by their association
with a common mode of the density estimator. Separate clusters must then
correspond to gaps in the estimated density. The analysis
is multi-scale, as different levels of smoothing provide different
aggregations.