Now Reading
Guide To Dimensionality Reduction With Recursive Feature Elimination

Guide To Dimensionality Reduction With Recursive Feature Elimination

Nowadays, data being generated is very rich in the information collected from various sources such as IoT devices, sensors, social media, etc. This leads the data to be high dimensional for particular problem statements. And we know, not each and every feature has importance at the same level; some are also irrelevant to the problem. So the huge challenge for the data scientist is high dimensional data analysis, and here comes the role of feature selection or feature engineering

As the various machine learning algorithms are highly prone to the curse of dimensionality, Feature selection gives an effective way to overcome challenges like overfitting, learning accuracy, computational time, and facilitating enhanced learning of models. Therefore, feature elimination in statistics and machine learning is referred to as choosing a subset of relevant features from the dataset to use in further model construction.   

Deep Learning DevCon 2021 | 23-24th Sep | Register>>

To know more about the feature selection techniques, one can refer to this article.

Recursive feature elimination, in short RFE, is a wrapper type feature selection technique which means that a different machine learning algorithm is used in the core of this method, which helps select the features.

This article will discuss the Recursive Feature Elimination technique, which is popular because it is easy to configure and use. As its name suggests, it recursively removes the feature and builds a model for the remaining features and calculates accuracy for those features. This process continues until we get the desired number of features. 

Follow us on Google News>>

Code implementation of Recursive Feature Elimination

Here we compare the feature importance given by the standard tree-based algorithm and feature importance given by the RFE.

This code implementation is divided into two parts; first, we make the feature selection from the sklearn SelectFromModel class; second, we do the feature selection using RFE with cross-validation to prevent overfitting. SelectFromModel is a meta-transformer that selects features based on feature importance given by the base estimator; here, the estimator is nothing but the tree-based algorithms. 

Import dependencies: 

 import pandas as pd
 import numpy as np
 # all the heavy lifting is done by the sklearn
 from sklearn.feature_selection import SelectFromModel,RFECV
 from sklearn.datasets import load_breast_cancer
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.ensemble import GradientBoostingRegressor
 from sklearn.model_selection import train_test_split,StratifiedKFold
 from sklearn.metrics import accuracy_score,r2_score 

Classification problem:

Feature selection using SelectFromModel:

First, we will carry out feature selection for classification problems for which we are using the breast cancer data. Then, let’s load dataset state input and output features.

 data = load_breast_cancer()
 x = pd.DataFrame(data.data,columns=data.feature_names)
 y = data.target
 x.head() 

We have a total of 30 features. So let’s calculate feature importance from SelectFromModel.

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = 0)
 sfm = SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=True,n_jobs=-1))
sfm.fit(x_train,y_train)
sfm.get_support()
 array([ True, False,  True,  True, False, False,  True,  True, False,
        False, False, False, False,  True, False, False, False, False,
        False, False,  True, False,  True,  True, False, False, False,
         True, False, False]) 

From the get_support attribute, we can check out of 30 variables how many variables are selected by the class SelectFromModel— true means the corresponding variable is chosen.

The main question is how the SelectFromModel carry out this feature selection? First, it calculates the importance of the features of all the variables given by the algorithm, and then it takes the mean of all feature importance. Then which feature is having values greater than the mean is taken as final features. 

To understand this, you can leverage the following code.

print('mean of feature importance:',np.mean(sfm.estimator_.feature_importances_))
Output:
mean of feature importance: 0.033333333333333326
sfm.estimator_.feature_importances_
Output:
array([0.05336183, 0.01728828, 0.05173067, 0.04075005, 0.00657474,
        0.00742122, 0.08843568, 0.10818283, 0.00377841, 0.00364655,
        0.01893041, 0.00400353, 0.00598462, 0.03979087, 0.00350484,
        0.00527455, 0.00543643, 0.0036614 , 0.00552986, 0.00401107,
        0.09243118, 0.01815056, 0.11413166, 0.10100969, 0.00941286,
        0.01425051, 0.02446495, 0.12945962, 0.01079693, 0.00859418])   

Total ten features are selected according to SelectFromModel.

 feature_selected = x_train.columns[sfm.get_support()]
 feature_selected 
 Index(['mean radius', 'mean perimeter', 'mean area', 'mean concavity',
        'mean concave points','area error','worst radius',                               'worst perimeter',
        'worst area', 'worst concave points'],
       dtype='object') 

Now let’s check the accuracy for all features and selected features. For that, we leverage a user-defined function that takes training and testing data to give accuracy.  

 def model_accuracy(x_train,x_test,y_train,y_test):
     model = RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1)
     model.fit(x_train,y_train)
     #training_score = model.score(x_train,y_train)
     y_pred = model.predict(x_test)
     return accuracy_score(y_test,y_pred)*100 

Assign new features-

 x_train_fs = sfm.transform(x_train)
 x_test_fs = sfm.transform(x_test) 

The accuracy on all the features and selected features, respectively.

 model_accuracy(x_train,x_test,y_train,y_test)
 94.73684210526315
 model_accuracy(x_train_fs,x_test_fs,y_train,y_test)
 95.6140350877193 

You can see how feature selection can improve the results. One can also check detailed classification reports, and there also you can see the improvements. Now let’s check the result with RFE with cross-validation.

Recursive Feature Elimination (RFE) with cross-validation: 

RFE is the same as SelectFromModel, but it chooses the features recursively, as said earlier.  Sklearn provides a separate RFE where you have to mention how many features you want for the model, but this approach is not advisable because we don’t know which feature combination gives the best result. Furthermore, even if you decide to use only RFE, you have to write additional iterable code to evaluate performance from a single feature to all the features.

Here comes cross-validation for rescue. Sklearn provides RFE along with cross-validation as RFECV. Where you have to mention the learning estimator and cross-validation technique, here we have used StratifiedKfold.   

Once configured, we have to fit the method for the training set, and by the get_support attribute, we can check the feature importance same as SelectFromModel.

rfecv = RFECV(RandomForestClassifier(n_estimators=100,random_state=True,n_jobs=-1),cv=StratifiedKFold(10))
rfecv.fit(x_train,y_train)
# check the support
rfecv.get_support()
 array([ True,  True,  True,  True, False, False,  True,  True, False,
        False, False, False, False,  True, False, False, False, False,
        False, False,  True,  True,  True,  True,  True,  True,  True,
         True, False, False]) 
# name wise features and count
feature_selected = x_train.columns[rfecv.get_support()]
feature_selected
 Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
        'mean concavity', 'mean concave points', 'area error', 'worst radius',
        'worst texture', 'worst perimeter', 'worst area', 'worst smoothness',
        'worst compactness', 'worst concavity', 'worst concave points'],
       dtype='object') 

It has selected the 15 features out of 30, giving the accuracy greater than the previous method. Let’s check;

 x_train_rfe = rfecv.transform(x_train)
 x_tets_rfe = rfecv.transform(x_test)
 model_accuracy(x_train_rfe,x_tets_rfe,y_train,y_test)
 97.36842105263158 

Now you can see the clear difference between the two feature selection methods. RFE gives the highest accuracy.

Regression Problem:

Feature selection using RFE with cross-validation:

We are generating a random dataset from make_regersssion.

from sklearn.datasets import make_regression
x,y = make_regression(n_samples=1000,n_features=15,n_targets=1)
 x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,     random_state = True)
# fit the method
rfecv = RFECV(GradientBoostingRegressor())
rfecv.fit(x_train,y_train)
rfecv.get_support()
 array([ True, False,  True, False,  True,  True, False,  True,  True,
         True,  True, False,  True,  True, False])
# function to calculate accuracy
def model_accuracy(x_train,x_test,y_train,y_test):
    model = GradientBoostingRegressor()
    model.fit(x_train,y_train)
    y_pred = model.predict(x_test)
    return r2_score(y_test,y_pred)*100 

Accuracy for all features and selected features respectively.

 model_accuracy(x_train,x_test,y_train,y_test)
 90.48235983692466
 x_train_rfe = rfecv.transform(x_train)
 x_tets_rfe = rfecv.transform(x_test)
 model_accuracy(x_train_rfe,x_tets_rfe,y_train,y_test)
 90.71163963962526 

For regression analysis, RFE performs slightly better than other models.

Note: Results may vary due to the stochastic nature of algorithms.

EndNotes:

For the classification problem, total features are 30; by using SelectFromModel, features are reduced to 10 and give accuracy of 94%, whereas by using RFE, features are reduced to 15 and give accuracy of 97%. For the regression problem, there is no change in accuracy as much as we expected. 

From this article, we have learned that Recursive feature elimination is an efficient approach to reduce features, and we have seen results practically by using python code.

References:

What Do You Think?

Join Our Discord Server. Be part of an engaging online community. Join Here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top