Search results

The ultimate goal of genome-wide association studies (GWAS) is understanding the underlying relationship between genetic variants and phenotype. While the heretability is largely missing in univariate analysis of traditional GWAS, it is believed that the joint analysis of variants, that are interactively functioning in a biological pathway (gene set), is more beneficial in detecting association signals. With the fast developing pace of sequencing techniques, more detailed human genome... Show moreThe ultimate goal of genome-wide association studies (GWAS) is understanding the underlying relationship between genetic variants and phenotype. While the heretability is largely missing in univariate analysis of traditional GWAS, it is believed that the joint analysis of variants, that are interactively functioning in a biological pathway (gene set), is more beneficial in detecting association signals. With the fast developing pace of sequencing techniques, more detailed human genome variation will be observed and hence the dimension of variants in the pathway could be extremely high. To model the systematic mechanism and the potential nonlinear interactions among the variants, in this dissertation we propose to model the set effect though a flexible non-parametric function under the high-dimensional setup, which allows the dimension goes to infinity as the size goes to infinity.Chapter 2 considers testing a nonparametric function of high-dimensional variates in a reproducing kernel Hilbert space (RKHS), which is a function space generated by a positive definite or semidefinite kernel function. We propose a test statistic to test the nonparametric function under the high-dimensional setting. The asymptotic distributions of the test statistic are derived under the null hypothesis and a series of local alternative hypotheses, the explicit power formula under which are also provided. We also develop a novel kernel selection procedure to maximize the power of the proposed test, as well as a kernel regularization procedure to further improve power. Extensive simulation studies and a real data analysis were conducted to evaluate the performance of the proposed method.Chapter 3 is theoretical investigation on the statistical optimality of kernel-based test statistic under the high-dimensional setup, from the minimax point of view. In particularly, we consider a high-dimensional linear model as the initial study. Unlike the sparsity or independence assumptions existing in related literature, we discussed the minimax properties under a structure free setting. We characterize the boundary that separates the testable region from the non-testable region, and show the rate-optimality of the kernel-based test statistic, under certain conditions on the covariance matrix and the growing speed of dimension.Our work in Chapter 4 fills the blank of kernel-based test using multiple candidate kernels under the high dimensional setting. Firstly, we extend the test statistic proposed in Chapter 2 to an inclusive form that allows the adjustment of covariants. The asymptotic distribution of the new test statistic under the null hypothesis is then provided. Two practical and efficient strategies are developed to incorporate multiple kernel candidates into the testing procedures. Through comprehensive simulation studies we show that both strategies can calibrates the type I error rate and improve the power over the the poor choice of kernel candidate in the set. Particularly, the maximum method, one of the two strategies, is shown having potential to boost the power close to one using the best candidate kernel. An application to Thai baby birth weight data further demonstrates the merits of our proposed methods. Show less

This thesis consists two chapters. The first chapter proposes a class of minimum distance tests for fitting a parametric regression model to a regression function when some responses are missing at random. These tests are based on a class of minimum integrated square distances between a kernel type estimator of a regression function and the parametric regression function being fitted. The estimators of the regression function are based on two completed data sets constructed by imputation and... Show moreThis thesis consists two chapters. The first chapter proposes a class of minimum distance tests for fitting a parametric regression model to a regression function when some responses are missing at random. These tests are based on a class of minimum integrated square distances between a kernel type estimator of a regression function and the parametric regression function being fitted. The estimators of the regression function are based on two completed data sets constructed by imputation and inverse probability weighting methods. The corresponding test statistics are shown to have asymptotic normal distributions under null hypothesis. Some simulation results are also presented.The second chapter considers the problem of testing the equality of two nonparametric regression curves against a one-sided alternatives based on two samples with possibly distinct design and error densities, when responses are missing at random. This chapter proposes a class of tests using imputation and covariate matching. The asymptotic distributions of these test statistics are shown to be Gaussian under null hypothesis and a class of local nonparametric alternatives. The consistency of these tests against a large class of fixed alternatives is also established. This chapter also includes a simulation study, which assesses the finite sample behavior of a member of this class of tests. Show less

There has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (10^21 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessitates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The state-of-the-art algorithms for clustering big data sets... Show moreThere has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (10^21 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessitates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The state-of-the-art algorithms for clustering big data sets are linear clustering algorithms, which assume that the data is linearly separable in the input space, and use measures such as the Euclidean distance to define the inter-point similarities. Though efficient, linear clustering algorithms do not achieve high cluster quality on real-world data sets, which are not linearly separable. Kernel-based clustering algorithms employ non-linear similarity measures to define the inter-point similarities. As a result, they are able to identify clusters of arbitrary shapes and densities. However, kernel-based clustering techniques suffer from two major limitations: (i) Their running time and memory complexity increase quadratically with the increase in the size of the data set. They cannot scale up to data sets containing billions of data points. (ii) The performance of the kernel-based clustering algorithms is highly sensitive to the choice of the kernel similarity function. Ad hoc approaches, relying on prior domain knowledge, are currently employed to choose the kernel function, and it is difficult to determine the appropriate kernel similarity function for the given data set.In this thesis, we develop scalable approximate kernel-based clustering algorithms using random sampling and matrix approximation techniques. They can cluster big data sets containing billions of high-dimensional points not only as efficiently as linear clustering algorithms but also as accurately as classical kernel-based clustering algorithms.Our first contribution is based on the premise that the similarity matrices corresponding to big data sets can usually be well-approximated by low-rank matrices built from a subset of the data. We develop an approximate kernel-based clustering algorithm, which uses a low-rank approximate kernel matrix, constructed from a uniformly sampled small subset of the data, to perform clustering. We show that the proposed algorithm has linear running time complexity and low memory requirements, and also achieves high cluster quality, when provided with sufficient number of data samples. We also demonstrate that the proposed algorithm can be easily parallelized to handle distributed data sets. We then employ non-linear random feature maps to approximate the kernel similarity function, and design clustering algorithms which enhance the efficiency of kernel-based clustering, as well as label assignment for previously unseen data points. Our next contribution is an online kernel-based clustering algorithm that can cluster potentially unbounded stream data in real-time. It intelligently samples the data stream and finds the cluster labels using these sampled points. The proposed scheme is more effective than the current kernel-based and linear stream clustering techniques, both in terms of efficiency and cluster quality. We finally address the issues of high dimensionality and scalability to data sets containing a large number of clusters. Under the assumption that the kernel matrix is sparse when the number of clusters is large, we modify the above online kernel-based clustering scheme to perform clustering in a low-dimensional space spanned by the top eigenvectors of the sparse kernel matrix. The combination of sampling and sparsity further reduces the running time and memory complexity. The proposed clustering algorithms can be applied in a number of real-world applications. We demonstrate the efficacy of our algorithms using several large benchmark text and image data sets. For instance, the proposed batch kernel clustering algorithms were used to cluster large image data sets (e.g. Tiny) containing up to 80 million images. The proposed stream kernel clustering algorithm was used to cluster over a billion tweets from Twitter, for hashtag recommendation. Show less

This thesis examines the design noise robust information retrieval techniques basedon kernel methods. Algorithms are presented for two biosensing applications: (1)High throughput protein arrays and (2) Non-invasive respiratory signal estimation.Our primary objective in protein array design is to maximize the throughput byenabling detection of an extremely large number of protein targets while using aminimal number of receptor spots. This is accomplished by viewing the proteinarray as a... Show moreThis thesis examines the design noise robust information retrieval techniques basedon kernel methods. Algorithms are presented for two biosensing applications: (1)High throughput protein arrays and (2) Non-invasive respiratory signal estimation.Our primary objective in protein array design is to maximize the throughput byenabling detection of an extremely large number of protein targets while using aminimal number of receptor spots. This is accomplished by viewing the proteinarray as a communication channel and evaluating its information transmission capacity as a function of its receptor probes. In this framework, the channel capacitycan be used as a tool to optimize probe design; the optimal probes being the onesthat maximize capacity. The information capacity is first evaluated for a small scaleprotein array, with only a few protein targets. We believe this is the first effort toevaluate the capacity of a protein array channel. For this purpose models of theproteomic channel's noise characteristics and receptor non-idealities, based on experimental prototypes, are constructed. Kernel methods are employed to extend thecapacity evaluation to larger sized protein arrays that can potentially have thousandsof distinct protein targets. A specially designed kernel which we call the ProteomicKernel is also proposed. This kernel incorporates knowledge about the biophysicsof target and receptor interactions into the cost function employed for evaluation of channel capacity.For respiratory estimation this thesis investigates estimation of breathing-rateand lung-volume using multiple non-invasive sensors under motion artifact and highnoise conditions. A spirometer signal is used as the gold standard for evaluation oferrors. A novel algorithm called the segregated envelope and carrier (SEC) estimation is proposed. This algorithm approximates the spirometer signal by an amplitudemodulated signal and segregates the estimation of the frequency and amplitude in-formation. Results demonstrate that this approach enables effective estimation ofboth breathing rate and lung volume. An adaptive algorithm based on a combination of Gini kernel machines and wavelet filltering is also proposed. This algorithm is titledthe wavelet-adaptive Gini (or WAGini) algorithm, it employs a novel wavelet trans-form based feature extraction frontend to classify the subject's underlying respiratorystate. This information is then employed to select the parameters of the adaptive kernel machine based on the subject's respiratory state. Results demonstrate significantimprovement in breathing rate estimation when compared to traditional respiratoryestimation techniques. Show less