Testing Independence in High Dimensions & Identifiability of Graphical Models

View/Open

Author

Metadata

Abstract

In this thesis two problems in multivariate statistics will be studied. In the first chaper, we treat the problem of testing independence between m continuous observations when m can be larger than the available sample size n. We consider three types of test statistics that are constructed as sums of many pairwise rank correlation signals. In the asymptotic regime where both m and n converge to infinity, a martingale central limit theorem is applied to show that the null distributions of these statistics converge to Gaussian limits, which are valid with no specific distributional or moment assumptions on the data. Using the framework of U-statistics, our result covers a variety of rank correlations including Kendall's tau and a dominating term of Spearman's rank correlation coefficient (rho), but also degenerate U-statistics such as Hoeffding's D, or the tau* of Bergsma and Dassios. Like the classical theory for U-statistics, the test statistics need to be scaled differently when the rank correlations used to construct them are degenerate U-statistics. The power of the considered tests is explored in rate-optimality theory under a Gaussian equicorrelation alternative as well as in numerical experiments for specific cases of more general alternatives. In the second chapter, we study parameter identifiability of directed Gaussian graphical models with one latent variable. In the scenario we consider, the latent variable is a confounder that forms a source node of the graph and is a parent to all other nodes, which correspond to the observed variables. We give a graphical condition that is sufficient for the Jacobian matrix of the parametrization map to be full rank, which entails that the parametrization is generically finite-to-one, a fact that is sometimes also referred to as local identifiability. We also derive a graphical condition that is necessary for such identifiability. Finally, we give a condition under which generic parameter identifiability can be determined from identifiability of a model associated with a subgraph. The power of these criteria is assessed via an exhaustive algebraic computational study for small models with 4, 5, and 6 observable variables, and a simulation study for large models with 25 or 35 observable variables.