A banking institution ran a direct marketing campaign based on phone calls. Oftentimes, more than
one contact to the same client was required, in order to assess if the product (bank term
deposit) would be subscribed or not. To solve this, we will predict whether someone will subscribe to
the term deposit or not based on the given information.

Our original approach was to create a simple poor man's stacking ensemble with models that were somewhat simplistic. We first tried stacking NaiveBayes, RandomForest, ExtraTreeClassifier, SVM and logistic regression together. These models performed well with regards to cross validation and on the testing set with roc-auc scores around 0.78-0.8. However, when submitting the Kaggle the score dropped to 0.76. Gaussian Naive Bayes was one of the strongest performers though did not add anything to the ensemble methods. As an aside, we tried NearestCentroid and KNN though had issues with consistent predict_proba calls so dropped these models as well.

*Note that all model hyperparameters were tuned using GridSearch with cv=5. Not all grid searches are included due to time in execution of the notebook.

Moving forward we decided to drop the SVM all together due to time constraints and drop NaiveBayes since it was performing strangely. We were then left with gradient boosting, ada boosting, easy ensembles, random forests and logistic regression with feature selection. The ExtraTree classifier was left out since the random forests would be more powerful and pick up on similar trends. AdaBoosting overfit too much and dominated the stacking ensemble. The gradient boost also overfit quite a lot but there seemed to be an improvement in the Kaggle scores when using this model. Cross-validation and the test set showed roc-auc's around 0.8 - 0.82 with the ensemble methods implementing the gradient boosted random forest. This left us with Gradient Boosting random forests and different implementations of logistic regression such as AdaBoost, easy ensemble and other sampling techniques.

The feature selection showed little imrpovements applied to all of the data prior to the voting classifier for poor-man's stacking. This was due to the tree models we were using. Instead, applying an RFE(RandomForest()) selector prior to logistic regression alone seemed to perform the best.

Standard scaling and min-max scaling seemed to perform similarly with standard scaling having a slight edge. This is contrary to our belief since the binary data from the dummy variables is between 0 and 1 where the standard scaler would shift the 0's to negative values.

Again, contrary to our belief, SMOTE and omitting a technique to deal with imbalance performed much better than using RandomUnderSampler. Using SMOTE(ratio = 0.5) followed by RandomUnderSampler seemed to give a compromise between the performance of SMOTE alone and RandomUnderSampler alone though showed no improvement over SMOTE alone. This was gauged with regard to logistic regression and the poor-man's stacking classifier.

Another approach we tried was to create an easy ensemble out of the voting classifiers which overfitted, though not as bad as adaboosting, and the results seemed to stay consistent regardless of the number of classifiers used. A further analysis of the affect of number of classifiers on easy ensemble is included at the very end in the analysis of resampling techniques.

categorical=['job','marital_status','education','credit_default','housing','loan','contact','month','day_of_week','prev_outcomes']#Removed Durationcontinuous=['age','campaign','prev_days','prev_contacts','nr_employed','emp_var_rate','cons_price_idx','cons_conf_idx','euribor3m']print("Total number of categorical predictors:",str(len(data[categorical].columns.values)))print("All categorical data as object:",str(data[categorical].dtypes.all()=='object'),'\n')print("Total number of continuous predictors:",str(len(data[continuous].columns.values)))print("All continues data as float64 or int64:",str(data[continuous].dtypes.all()in['float64','int64']))

Total number of categorical predictors: 10
All categorical data as object: True
Total number of continuous predictors: 9
All continues data as float64 or int64: True

Since our goal is to predict whether someone will subscribe to the term deposit or not based on the given information, we define subscribed variable to be our response variable.

In [5]:

data.subscribed.value_counts()

Out[5]:

no 29238
yes 3712
Name: subscribed, dtype: int64

Note that we see imbalanced data here between no and yes. We would like to change no to 0 and yes to 1 as classification values so it would be easier to deal with as we model the data, but lets explore this more on the next step.

Also, it's good to see that there are no unknown values, so we don't need to drop any datapoints or rows.

Note below that we are also dropping duration variable since it's prohibited in the assignment.

In [33]:

fromsklearn.model_selectionimporttrain_test_splitsubscribed=data.subscribeddata_=data.drop(["duration","subscribed"],axis=1)X_train,X_test,y_train,y_test=train_test_split(data_,subscribed=="yes",random_state=0,stratify=subscribed)print("Size for X_train:",X_train.shape)print("Size for X_test:",X_test.shape)print("Size for y_train:",y_train.shape)print("Size for y_test:",y_test.shape)

In this step, we expect you to look into the data and try to understand it before modeling. This understanding may lead to some basic data preparation steps which are common across the two model sets required.

We also used density plots below to visualize the distribution of those who subscribed and those who did not subscribed (y-axis) for each continuous variable (x-axis). The kernel gaussian density is used to draw inferences about the population of those who subscribed vs. those who didn't.