]]>http://darques.eu/blog/index.php/2018/01/14/configurer-un-environnement-python-sous-windows-10/feed/0Porto Seguro’s Safe Driver Prediction (Kaggle)http://darques.eu/blog/index.php/2017/12/21/porto-seguros-safe-driver-prediction-kaggle/
http://darques.eu/blog/index.php/2017/12/21/porto-seguros-safe-driver-prediction-kaggle/#commentsThu, 21 Dec 2017 15:45:21 +0000http://darques.eu/blog/?p=400Read More ...]]>This competition was held on Kaggle from august to november 2017. Porto Seguro is a large brasilian insurance company that whishes to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year.

The data

The training data is a anonymized 113Mo .csv file with 59 features and ~ 595k samples and the data for submitting predictions has around 892k samples.

The feature engineering will be limited here, because we have almost no information about the features, except being numerical, categorical or binary. I could try to randomly combine or transform features, but doing so involves luck and I do not want to waste time here.

I have written a class that:

Analyses the data to automatically determine the type of the features

Encodes missing values with different strategies depending on the type : average or mean for numeric features, most frequent for categorical or binaries.

Note that the data is imbalanced since only about 3.6% of the target are “1”. I’ll deal with it by using built-in parameters such as “class_weight” or by manually oversampling the data.

Features selection

First, I have deleted features having too many null values : data.drop([“ps_car_03_cat”,”ps_car_05_cat”], inplace=True, axis=1)data.drop(["ps_car_03_cat","ps_car_05_cat"], inplace=True, axis=1)

Following several discussions on Kaggle’s forum, all features labeled like “*_calc_*” are deleted. We suppose these were preprocessed by Porto Seguro’s data scientists and therefore useless for an anonymized data set. Most models found on Kaggle have got rid of these and showed an increase in the final score.

So you can tune the number of features by changing the threshold on line 10.

Logistic regression

This linear model for classification is one of the simplest model we can use for binary classification. I have used it in the first steps of the competition, to test for example the sample selection threshold and other parameters.

One limitation is that binarization of the data is necessary, leading to more than 230 features. To speed up calculation, I have used PCA to reduce the number of features. I have written the function below that pipelines PCA then logistic regression.

Parameter “class_weight” can be tuned to take into account the data imbalance. I have adjusted the parameters using a randomized search rather than a grid search. To achieve the highest Gini score, the optimal number of components for the PCA lies in the 140-180 range (depending on the parameters, and features selected), and the class weights should be adjusted so that the % of positive predictions is slightly over 4%.

The graphs below show the train and test score as a function of the number of components, as well as the fitting time. It can be clearly seen that the test score reaches its optimum around 140 components. The importance of the PCA is also seen on the fitting time that exponentially increases above ~ 180 components.

Gradient boosting classifier

For gradient boosting classification I have used XGBoost’s powerful scikit-learn API. The data is beforehand binarized and features selected using tree classifier as described in the previous section. This classifier has many parameters, so a random search is necessary for fine-tuning the model.

Neural network model

Since the DNNClassifier does not have a class_weight balance parameter I have written a simple function “classbalance” that oversamples the less represented class to the desired level.

The number of layers and nodes is defined on line 79. Then from line 97 a little work is needed to get the predictions since Tensorflow outputs them as a list of dictionnaries. So I’ll enumerate the predictions to get only the predicted classes and the associated probabilities.

Final predictions and score

The best score I could achieve was by averaging the predictions using Tensorflow and XGBoost.

My best score was a Gini of 0.28054 (~top 50%).

I suppose I could try different balancing of these models to improve the score and of course, re-train all the models using crossvalidation but my goal here was to try and test different models, not necessarily to get the highest possible rank.

My training data contains 891 samples and 16 features, from which I’ll be using only 5 as in the previous article. We can split the data into train/test sets, here I’ll use all of the data for training.

Let’s move to building a neural network (NN). Tensorflow has several high level functions that we can use : for instance let’s build a n-hidden units NN and use a ProximalGradientDescentOptimizer instead of the default AdaGradOptimizer (this is actually not the best option, but take this as an illustration).

To build the training input, we use tf.estimator.inputs.numpy_input_fn that returns input function that would feed dict of numpy arrays into the model. Pay attention to the shape of “y”, the target array.

]]>http://darques.eu/blog/index.php/2017/10/24/tensorflow-classification-example-titanic-competition/feed/0Setup Jupyter notebook default folder in Anacondahttp://darques.eu/blog/index.php/2017/08/25/setup-jupyter-notebook-default-folder-in-anaconda/
http://darques.eu/blog/index.php/2017/08/25/setup-jupyter-notebook-default-folder-in-anaconda/#commentsFri, 25 Aug 2017 13:10:25 +0000http://darques.eu/blog/?p=370Read More ...]]>I had to perform a clean install of my computer and struggled to get Jupyter properly configured. The default folder is located on C:\, but all my files are in my cloud folder on another hard drive.

There is no way to easily change the default folder from Anaconda, so here’s how to proceed :

First, launch the command prompt from Anaconda by clicking on “Open Terminal” (from any environment)

The enter the following command : jupyter notebook –generate-config, and wait a few seconds :

This will create a file named “jupyter_notebook_config.py” in your user folder. Open this file and search for “#c.NotebookApp.notebook_dir = ‘‘

Do not forget to remove the # symbol to uncomment and insert the path to your notebooks folder, using forward slashes, e.g. D:/MyFiles/Notebooks/Python

If you’re interested in classification, have a look at this great tutorial on analytics Vidhya.

I have decided to test the library on Kaggle’s competition “House prices”, to compare with my last solution.

Downloading and installing MLBox

Good news, MLBox is available for Linux, but not only since I could install it on my Windows 10 system, using Anaconda Navigator. However, there is no conda package, so we’re going to experience some difficulties (I hope I’ll save you a few hours).

Create a MLBox environment in Anaconda, python 3.6 is OK. Simply click on ‘Create’ (bottom left), select python 3.6 and click ‘Create’ again; then you’ll have to wait for Anaconda to download packages and install everything.

Creating a new environment in Anaconda

Before trying to install MLBox, you’ll have to install XGBoost.

XGBoost can not be installed via pip at the moment, so if you try to install MLBox, the installation process will crash.

You can chose Git or MinGW if you feel comfortable with it, but I used a simpler method. Go to this page and download the .whl file corresponding to your system. Next open a command prompt from Anaconda in your newly created environment

Open terminal from here

And cd to the folder you’ve downloaded the .whl file, then from here simply use pip install xgboost-0.6-cp36-cp36m-win_amd64.whl (or whatever file name you downloaded).

Now you should be able to proceed by simply typing

pip install mlbox

If the installation stumbles on a blocked dependency, maybe you’ll have to download the corresponding .whl file and pip install it before retrying to install mlbox.

OK, we are now ready to work !

Using MLBox for regression

1. MLBox “Blackbox” edition

MLBox can be used as a complete blackbox : you feed it train/test sets, define the target and you’re done.

Let’s try the basic blackbox approach.

#Import MLBox and other packages
import mlbox as mlb #I don't really like * imports
#Read files using preprocessing.reader
#Usage: train_test_split(path to training data,path to test data,target)
#Target is "SalePrice", i.e. the price the houses are sold
data=mlb.preprocessing.Reader(sep=",").train_test_split(["data.csv","data_test.csv"],'SalePrice')
#Preprocess data
#1/ Remove Ids
#2/ Delete drifting data between train and test sets
data=mlb.preprocessing.Drift_thresholder().fit_transform(data)
#Train and evaluate with default parameters
#best is a set of parameters that were estimated as being the best
best = mlb.optimisation.Optimiser().evaluate(None, data)
#Predict on the test data using best parameters
mlb.prediction.Predictor().fit_predict(best, data)

That was easy. We now have a subfolder names “save” in which we can find a .csv file with the predictions, as well as features importance and drift coefficients for all variables.

Let’s submit the prediction to kaggle … wait for data uploading and processing … and the score is ….

0.26013

Hmm, that’s not too bad considering I’ve simply ran a library on data I have barely examined. However, with such a score I’d rank 1590 out of 1765.

There’s hopefully something we can do about it.

2. Optimising the pipeline

So, we’ve seen that MLBox can deliver some useable predictions without any work, but we can do much better by optimising the model parameters. Similarly to GridSearchCV in sciki-learn, we’ll feed the model a dictionnary containing key,value pairs of parameters.

Note that for the scoring I had to select “r2” instead of “mean_squared_error”, because of an annoying deprecationg warning since “mean_squared_error” has been replaced by “neg_mean_squared_error” in sklearn 0.18 (I’ll try to report this ‘bug’).

I have set max_evals=120 instead of the default value (40), it seems to yield better results but I suggest you keep at 40 for your tests, since it increases the optimisation time a lot (2600 seconds for the code above).

Missing numerical values are set to 0, while categorical are set to the string ‘None’, about 16% of the features are removed and selected by variance. Numeric parameters for the estimator are also shown.

In the subfolder ‘save’ we also have a bar graph of features importance :

Features importance as determined by MLBox

From a business point of view, this graph makes sense since the most important features include : the area of the house, its overall quality, surface of the garage, area of the garden, year the house was built, the quality of the neighborhood, etc.

Hopefully, as shown in my previous post, my best score is 0.12594 which ranked me 661/1765. I’m still 7 places above the machine, but probably not for long as I’ll improve the optimisation step.

3. Being smarter than MLBox ?

Considering I spent several days building a model manually, I was equally frustrated and excited to see that a few lines could provide a similar result. I decided to check whether I could be more intelligent by preprocessing the data manually, as I did in my previous post.

I went back to the data set and decided to remove some features that I considered irrelevant. I tried several combinations with [“GarageYrBlt”, “MoSold”,”MasVnrArea”, “GarageCars”, “GarageArea”]. In particular, when looking at the (reduced) correlation map :

Correlation map for selected features

There seems to be a strong collinearity between “GarageCars” (number of cars the parking can hold) and “GarageArea” meaning we have a redundancy there. Intuitively I would try to delete one of these features (or create a new one by combining them).

Well, I confess any combination I used led to still good but worse scores : 0.12672, 0.12827, 0.12852…

Whatever I try, the algorithm always performs better by itself: pretty impressive! I suppose, MLBox is smart enough to deal with multicollinearity. I feel like it’s tellig me : “Don’t even try, I don’t need you”.

Conclusion

First I hope this tutorial will help you start with MLBox, which I think is a wonderful tool for machine learning.

I hope I’ll have enough time to try and improve the model for this House price problem. I already have some ideas I’d like to implement, so stay tuned for a next blog post.

Concerning the library itself it is :

Easy to use

Fast

Able to yield interesting results with very little knowledge of the underlying models

Able to yield much better results with some finetuning, so do not expect it to solve your problems too easily

Do not forget to keep an eye on the subfolder /save/joblib that can grow rapidly : mine went up 6 GB in just 1 day (not really good for my online drive and limited bandwitdth)

As a side note, I’d like to point out I was not able to get any good prediction with LightGBM (nothing better than 0.26), whatever parameters I tried. If you have any idea why, please comment below.