How to take advantage of CorReg ?

Clément Théry

26 avril 2017

CorReg’s Concept

This package was motivated by correlation issues in real datasets, in particular industrial datasets.

The main idea stands in explicit modeling of the correlations between covariates by a structure of sub-regressions (so it can model complex links, not only correlations between two variables), that simply is a system of linear regressions between the covariates. It points out redundant covariates that can be deleted in a pre-selection step to improve matrix conditioning without significant loss of information and with strong explicative potential because this pre-selection is explained by the structure of sub-regressions, itself easy to interpret. An algorithm to find the sub-regressions structure inherent to the dataset is provided, based on a full generative model and using Monte-Carlo Markov Chain (MCMC) method. This pre-treatment does not depend on a response variable and thus can be used in a more general way with any correlated datasets.

In a second part, a plug-in estimator is defined to get back the redundant covariates sequentially. Then all the covariates are used but the sequential approach acts as a protection against correlations.

This package also contains some functions to make statistics easier (see BoxPlot and recursive_tree or matplot_zone).

In this vignette we explain the main method in CorReg that leads to the correg function. This function allows to make linear regression using sub-regressions structure and/or many variable selection methods (including lasso, ridge, clere, stepwise,…)

Dataset generation

We first generate (for this tutorial) a dataset with strong correlations between the covariates. The mixture_generator function gives such a dataset and also a validation sample built with the same parameters. Both contains a response variable generated on some covariates (not all) by linear regression. So we have sub-regressions that make some variables redundent when we know the other variables. Such correlations make the variance of the regression estimators explode. Moreover, dimension is unnecessarily high and interpretation of regression coefficients is dangerous.

We will try to find this structure (in real life we don’t know the true structure).

CorReg’s Method

Density estimation

To find the correlation structure we need a global density model that will serve as a null hypothesis. Each variable can be independent (following a certain density we have to estimate) or can be lineary dependent on other covariates. The density_estimation function will provide this null hypothesis for each variable. We use Gaussian Mixture to fit a large scope of density in real life.

#density estimation for the MCMC (with Gaussian Mixtures)
density=density_estimation(X=X_appr,nbclustmax=10,detailed=TRUE,package ="Rmixmod")
Bic_null_vect=density$BIC_vect# vector of the BIC found (1 value per covariate)

Each null hypothesis is associated to a BIC criterion. Complexity is the one of each mixture model.