Active Hackathon

Guide To Dimensionality Reduction With Recursive Feature Elimination

As the various machine learning algorithms are highly prone to the curse of dimensionality, Feature selection gives...

Nowadays, data being generated is very rich in the information collected from various sources such as IoT devices, sensors, social media, etc. This leads the data to be high dimensional for particular problem statements. And we know, not each and every feature has importance at the same level; some are also irrelevant to the problem. So the huge challenge for the data scientist is high dimensional data analysis, and here comes the role of feature selection or feature engineering

As the various machine learning algorithms are highly prone to the curse of dimensionality, Feature selection gives an effective way to overcome challenges like overfitting, learning accuracy, computational time, and facilitating enhanced learning of models. Therefore, feature elimination in statistics and machine learning is referred to as choosing a subset of relevant features from the dataset to use in further model construction.   


Sign up for your weekly dose of what's up in emerging technology.

To know more about the feature selection techniques, one can refer to this article.

Recursive feature elimination, in short RFE, is a wrapper type feature selection technique which means that a different machine learning algorithm is used in the core of this method, which helps select the features.

This article will discuss the Recursive Feature Elimination technique, which is popular because it is easy to configure and use. As its name suggests, it recursively removes the feature and builds a model for the remaining features and calculates accuracy for those features. This process continues until we get the desired number of features. 

Code implementation of Recursive Feature Elimination

Here we compare the feature importance given by the standard tree-based algorithm and feature importance given by the RFE.

This code implementation is divided into two parts; first, we make the feature selection from the sklearn SelectFromModel class; second, we do the feature selection using RFE with cross-validation to prevent overfitting. SelectFromModel is a meta-transformer that selects features based on feature importance given by the base estimator; here, the estimator is nothing but the tree-based algorithms. 

Import dependencies: 

 import pandas as pd
 import numpy as np
 # all the heavy lifting is done by the sklearn
 from sklearn.feature_selection import SelectFromModel,RFECV
 from sklearn.datasets import load_breast_cancer
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.ensemble import GradientBoostingRegressor
 from sklearn.model_selection import train_test_split,StratifiedKFold
 from sklearn.metrics import accuracy_score,r2_score 

Classification problem:

Feature selection using SelectFromModel:

First, we will carry out feature selection for classification problems for which we are using the breast cancer data. Then, let’s load dataset state input and output features.

 data = load_breast_cancer()
 x = pd.DataFrame(,columns=data.feature_names)
 y =

We have a total of 30 features. So let’s calculate feature importance from SelectFromModel.

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = 0)
 sfm = SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=True,n_jobs=-1)),y_train)
 array([ True, False,  True,  True, False, False,  True,  True, False,
        False, False, False, False,  True, False, False, False, False,
        False, False,  True, False,  True,  True, False, False, False,
         True, False, False]) 

From the get_support attribute, we can check out of 30 variables how many variables are selected by the class SelectFromModel— true means the corresponding variable is chosen.

The main question is how the SelectFromModel carry out this feature selection? First, it calculates the importance of the features of all the variables given by the algorithm, and then it takes the mean of all feature importance. Then which feature is having values greater than the mean is taken as final features. 

To understand this, you can leverage the following code.

print('mean of feature importance:',np.mean(sfm.estimator_.feature_importances_))
mean of feature importance: 0.033333333333333326
array([0.05336183, 0.01728828, 0.05173067, 0.04075005, 0.00657474,
        0.00742122, 0.08843568, 0.10818283, 0.00377841, 0.00364655,
        0.01893041, 0.00400353, 0.00598462, 0.03979087, 0.00350484,
        0.00527455, 0.00543643, 0.0036614 , 0.00552986, 0.00401107,
        0.09243118, 0.01815056, 0.11413166, 0.10100969, 0.00941286,
        0.01425051, 0.02446495, 0.12945962, 0.01079693, 0.00859418])   

Total ten features are selected according to SelectFromModel.

 feature_selected = x_train.columns[sfm.get_support()]
 Index(['mean radius', 'mean perimeter', 'mean area', 'mean concavity',
        'mean concave points','area error','worst radius',                               'worst perimeter',
        'worst area', 'worst concave points'],

Now let’s check the accuracy for all features and selected features. For that, we leverage a user-defined function that takes training and testing data to give accuracy.  

 def model_accuracy(x_train,x_test,y_train,y_test):
     model = RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1),y_train)
     #training_score = model.score(x_train,y_train)
     y_pred = model.predict(x_test)
     return accuracy_score(y_test,y_pred)*100 

Assign new features-

 x_train_fs = sfm.transform(x_train)
 x_test_fs = sfm.transform(x_test) 

The accuracy on all the features and selected features, respectively.


You can see how feature selection can improve the results. One can also check detailed classification reports, and there also you can see the improvements. Now let’s check the result with RFE with cross-validation.

Recursive Feature Elimination (RFE) with cross-validation: 

RFE is the same as SelectFromModel, but it chooses the features recursively, as said earlier.  Sklearn provides a separate RFE where you have to mention how many features you want for the model, but this approach is not advisable because we don’t know which feature combination gives the best result. Furthermore, even if you decide to use only RFE, you have to write additional iterable code to evaluate performance from a single feature to all the features.

Here comes cross-validation for rescue. Sklearn provides RFE along with cross-validation as RFECV. Where you have to mention the learning estimator and cross-validation technique, here we have used StratifiedKfold.   

Once configured, we have to fit the method for the training set, and by the get_support attribute, we can check the feature importance same as SelectFromModel.

rfecv = RFECV(RandomForestClassifier(n_estimators=100,random_state=True,n_jobs=-1),cv=StratifiedKFold(10)),y_train)
# check the support
 array([ True,  True,  True,  True, False, False,  True,  True, False,
        False, False, False, False,  True, False, False, False, False,
        False, False,  True,  True,  True,  True,  True,  True,  True,
         True, False, False]) 
# name wise features and count
feature_selected = x_train.columns[rfecv.get_support()]
 Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
        'mean concavity', 'mean concave points', 'area error', 'worst radius',
        'worst texture', 'worst perimeter', 'worst area', 'worst smoothness',
        'worst compactness', 'worst concavity', 'worst concave points'],

It has selected the 15 features out of 30, giving the accuracy greater than the previous method. Let’s check;

 x_train_rfe = rfecv.transform(x_train)
 x_tets_rfe = rfecv.transform(x_test)

Now you can see the clear difference between the two feature selection methods. RFE gives the highest accuracy.

Regression Problem:

Feature selection using RFE with cross-validation:

We are generating a random dataset from make_regersssion.

from sklearn.datasets import make_regression
x,y = make_regression(n_samples=1000,n_features=15,n_targets=1)
 x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,     random_state = True)
# fit the method
rfecv = RFECV(GradientBoostingRegressor()),y_train)
 array([ True, False,  True, False,  True,  True, False,  True,  True,
         True,  True, False,  True,  True, False])
# function to calculate accuracy
def model_accuracy(x_train,x_test,y_train,y_test):
    model = GradientBoostingRegressor(),y_train)
    y_pred = model.predict(x_test)
    return r2_score(y_test,y_pred)*100 

Accuracy for all features and selected features respectively.

 x_train_rfe = rfecv.transform(x_train)
 x_tets_rfe = rfecv.transform(x_test)

For regression analysis, RFE performs slightly better than other models.

Note: Results may vary due to the stochastic nature of algorithms.


For the classification problem, total features are 30; by using SelectFromModel, features are reduced to 10 and give accuracy of 94%, whereas by using RFE, features are reduced to 15 and give accuracy of 97%. For the regression problem, there is no change in accuracy as much as we expected. 

From this article, we have learned that Recursive feature elimination is an efficient approach to reduce features, and we have seen results practically by using python code.


More Great AIM Stories

Vijaysinh Lendave
Vijaysinh is an enthusiast in machine learning and deep learning. He is skilled in ML algorithms, data manipulation, handling and visualization, model building.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM