4 ways to implement feature selection in Python for machine learning

This article is an excerpt from a book written by Ankit Dixit titled Ensemble Machine Learning. This book serves as a beginner’s guide to combining powerful machine learning algorithms to build optimized models.

In this article, we will look at different methods to select features from the dataset; and discuss types of feature selection algorithms with their implementation in Python using the Scikit-learn (sklearn) library:

Univariate selection

Recursive Feature Elimination (RFE)

Principle Component Analysis (PCA)

Choosing important features (feature importance)

We have explained first three algorithms and their implementation in short. Further we will discuss Choosing important features (feature importance) part in detail as it is widely used technique in the data science community.

Univariate selection

Statistical tests can be used to select those features that have the strongest relationships with the output variable.

The scikit-learn library provides the SelectKBest class, which can be used with a suite of different statistical tests to select a specific number of features.

The following example uses the chi squared (chi^2) statistical test for non-negative features to select four of the best features from the Pima Indians onset of diabetes dataset:

Recursive Feature Elimination

RFE works by recursively removing attributes and building a model on attributes that remain. It uses model accuracy to identify which attributes (and combinations of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation.

The following example uses RFE with the logistic regression algorithm to select the top three features. The choice of algorithm does not matter too much as long as it is skillful and consistent:

You can see that RFE chose the the top three features as preg, mass, and pedi. These are marked True in the support_ array and marked with a choice 1 in the ranking_ array.

Principle Component Analysis

PCA uses linear algebra to transform the dataset into a compressed form. Generally, it is considered a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result.

In the following example, we use PCA and select three principal components:

Choosing important features (feature importance)

Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let’s understand it in detail.

Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy.

A random forest consists of a number of decision trees. Every node in a decision tree is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is known as impurity. For classification, it is typically either the Gini

impurity or information gain/entropy, and for regression trees, it is the variance. Thus when training a tree, it can be computed by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

Let’s see how to do feature selection using a random forest classifier and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset. This dataset is available for free from kaggle (you will need to sign up to kaggle to be able to download this dataset). You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory.

This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This is the time to load the dataset. We will load the train.csv file; this file contains more than 61,000 training instances. We will use 50000 instances for our example, in which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier:

Let’s take note of the data size here; as our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is quite large. Let’s see:

Shape of the dataset (35000, 94)
Size of Data set before feature selection: 26.32 MB

As you can see, we are having 35000 rows and 94 columns in our dataset, which is more than 26 MB data.

In the next code block, we will configure our random forest classifier; we will use 250 trees with a maximum depth of 30 and the number of random features will be 7. Other hyperparameters will be the default of sklearn:

#Lets select the test data for model evaluation purpose
Xtest = test[:,0:94] ytest = test[:,94]
#Create a random forest classifier with the following Parameters
trees = 250
max_feat = 7
max_depth = 30
min_sample = 2
clf = RandomForestClassifier(n_estimators=trees,
max_features=max_feat,
max_depth=max_depth,
min_samples_split= min_sample, random_state=0,
n_jobs=-1)
#Train the classifier and calculate the training time
import time
start = time.time() clf.fit(Xtrain, ytrain) end = time.time()
#Lets Note down the model training time
print("Execution time for building the Tree is: %f"%(float(end)- float(start)))
pre = clf.predict(Xtest)
Let's see how much time is required to train the model on the training dataset:
Execution time for building the Tree is: 2.913641
#Evaluate the model performance for the test data
acc = getAccuracy(pre, ytest)
print("Accuracy of model before feature selection is %.2f"%(100*acc))

The accuracy of our model is:

Accuracy of model before feature selection is 98.82

As you can see, we are getting very good accuracy as we are classifying almost 99% of the test data into the correct categories. This means we are classifying about 14,823 instances out of 15,000 in correct classes.

So, now my question is: should we go for further improvement? Well, why not? We should definitely go for more improvements if we can; here, we will use feature importance to select features. As you know, in the tree building process, we use impurity measurement for node selection. The attribute value that has the lowest impurity is chosen as the node in the tree. We can use similar criteria for feature selection. We can give more importance to features that have less impurity, and this can be done using the feature_importances_ function of the sklearn library. Let’s find out the importance of each feature:

#Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_):

As you can see here, each feature has a different importance based on its contribution to the final prediction.

We will use these importance scores to rank our features; in the following part, we will select those features that have feature importance more than 0.01 for model training:

#Select features which have higher contribution in the final prediction
sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain)

Here, we will transform the input dataset according to the selected feature attributes. In the next code block, we will transform the dataset. Then, we will check the size and shape of the new dataset:

#Transform input dataset
Xtrain_1 = sfm.transform(Xtrain) Xtest_1 = sfm.transform(Xtest)
#Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6))
shape = np.shape(Xtrain_1)
print("Shape of the dataset ",shape)
Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20)

Do you see the shape of the dataset? We are left with only 20 features after the feature selection process, which reduces the size of the database from 26 MB to 5.60 MB. That’s about 80% reduction from the original dataset.

In the next code block, we will train a new random forest classifier with the same hyperparameters as earlier and test it on the testing dataset. Let’s see what accuracy we get after modifying the training set:

Can you see that!! We have got 99.97 percent accuracy with the modified dataset, which means we are classifying 14,996 instances in correct classes, while previously we were classifying only 14,823 instances correctly.

This is a huge improvement we have got with the feature selection process; we can summarize all the results in the following table:

Evaluation criteria

Before feature selection

After feature selection

Number of features

94

20

Size of dataset

26.32 MB

5.60 MB

Training time

2.91 seconds

1.71 seconds

Accuracy

98.82 percent

99.97 percent

The preceding table shows the practical advantages of feature selection. You can see that we have reduced the number of features significantly, which reduces the model complexity and dimensions of the dataset. We are getting less training time after the reduction in dimensions, and at the end, we have overcome the overfitting issue, getting higher accuracy than before.

To summarize the article, we explored 4 ways of feature selection in machine learning.

If you found this post is useful, do check out the book Ensemble Machine Learning to know more about stacking generalization among other techniques.

Subscribe to the weekly Packt Hub newsletter. We'll send you the results of our AI Now Survey, featuring data and insights from across the tech landscape.