Given a list of several paths and a target name, automatically creates and cleans train and test datasets.
IMPORTANT: a dataset is considered as a test set if it does not contain the target value. Otherwise it is
considered as part of a train set.
Also determines the task and encodes the target (classification problem only).

Finally dumps the datasets to hdf5, and eventually the target encoder.

Parameters:

Lpath (list, defaut = None) – List of str paths to load the data

target_name (str, default = None) – The name of the target. Works for both classification
(multiclass or not) and regression.

Returns:

Dictionnary containing :

’train’ : pandas dataframe for train dataset

’test’ : pandas dataframe for test dataset

’target’ : encoded pandas Serie for the target on train set (with dtype=’float’ for a regression or dtype=’int’ for a classification)

threshold (float, defaut = 0.6) – Drift threshold under which features are kept. Must be between 0. and 1.
The lower the more you keep non-drifting/stable variables: a feature with
a drift measure of 0. is very stable and a one with 1. is highly unstable.

A stacking classifier is a classifier that uses the predictions of
several first layer estimators (generated with a cross validation method)
for a second layer estimator.

Parameters:

base_estimators (list, default = [Classifier(strategy="XGBoost"), Classifier(strategy="RandomForest"),Classifier(strategy="ExtraTrees")]) – List of estimators to fit in the first level using a cross validation.

level_estimator (object, default = LogisticRegression()) – The estimator used in second and last level.

n_folds (int, default = 5) – Number of folds used to generate the meta features for the training set

copy (bool, default = False) – If true, meta features are added to the original dataset

several first layer estimators (generated with a cross validation method)
for a second layer estimator.

Parameters:

base_estimators (list, default = [Regressor(strategy="XGBoost"), Regressor(strategy="RandomForest"), Regressor(strategy="ExtraTrees")]) – List of estimators to fit in the first level using a cross validation.

level_estimator (object, default = LinearRegression()) – The estimator used in second and last level

n_folds (int, default = 5) – Number of folds used to generate the meta features for the training set

copy (bool, default = False) – If true, meta features are added to the original dataset

random_state (None, int or RandomState. default = 1) – Pseudo-random number generator state used for shuffling.
If None, use default numpy RNG for shuffling.

Evaluates the data with a given scoring function and given hyper-parameters
of the whole pipeline. If no parameters are set, default configuration for
each step is evaluated : no feature selection is applied and no meta features are
created.