Nowadays, data being generated is very rich in the information collected from various sources such as IoT devices, sensors, social media, etc. This leads the data to be high dimensional for particular problem statements. And we know, not each and every feature has importance at the same level; some are also irrelevant to the problem. So the huge challenge for the data scientist is high dimensional data analysis, and here comes the role of feature selection or feature engineering.
As the various machine learning algorithms are highly prone to the curse of dimensionality, Feature selection gives an effective way to overcome challenges like overfitting, learning accuracy, computational time, and facilitating enhanced learning of models. Therefore, feature elimination in statistics and machine learning is referred to as choosing a subset of relevant features from the dataset to use in further model construction.
To know more about the feature selection techniques, one can refer to this article.
Recursive feature elimination, in short RFE, is a wrapper type feature selection technique which means that a different machine learning algorithm is used in the core of this method, which helps select the features.
This article will discuss the Recursive Feature Elimination technique, which is popular because it is easy to configure and use. As its name suggests, it recursively removes the feature and builds a model for the remaining features and calculates accuracy for those features. This process continues until we get the desired number of features.
Code implementation of Recursive Feature Elimination
Here we compare the feature importance given by the standard tree-based algorithm and feature importance given by the RFE.
This code implementation is divided into two parts; first, we make the feature selection from the sklearn SelectFromModel class; second, we do the feature selection using RFE with cross-validation to prevent overfitting. SelectFromModel is a meta-transformer that selects features based on feature importance given by the base estimator; here, the estimator is nothing but the tree-based algorithms.
import pandas as pd import numpy as np # all the heavy lifting is done by the sklearn from sklearn.feature_selection import SelectFromModel,RFECV from sklearn.datasets import load_breast_cancer from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingRegressor from sklearn.model_selection import train_test_split,StratifiedKFold from sklearn.metrics import accuracy_score,r2_score
Feature selection using SelectFromModel:
First, we will carry out feature selection for classification problems for which we are using the breast cancer data. Then, let’s load dataset state input and output features.
data = load_breast_cancer() x = pd.DataFrame(data.data,columns=data.feature_names) y = data.target x.head()
We have a total of 30 features. So let’s calculate feature importance from SelectFromModel.
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = 0) sfm = SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=True,n_jobs=-1)) sfm.fit(x_train,y_train) sfm.get_support() array([ True, False, True, True, False, False, True, True, False, False, False, False, False, True, False, False, False, False, False, False, True, False, True, True, False, False, False, True, False, False])
From the get_support attribute, we can check out of 30 variables how many variables are selected by the class SelectFromModel— true means the corresponding variable is chosen.
The main question is how the SelectFromModel carry out this feature selection? First, it calculates the importance of the features of all the variables given by the algorithm, and then it takes the mean of all feature importance. Then which feature is having values greater than the mean is taken as final features.
To understand this, you can leverage the following code.
print('mean of feature importance:',np.mean(sfm.estimator_.feature_importances_)) Output: mean of feature importance: 0.033333333333333326 sfm.estimator_.feature_importances_ Output: array([0.05336183, 0.01728828, 0.05173067, 0.04075005, 0.00657474, 0.00742122, 0.08843568, 0.10818283, 0.00377841, 0.00364655, 0.01893041, 0.00400353, 0.00598462, 0.03979087, 0.00350484, 0.00527455, 0.00543643, 0.0036614 , 0.00552986, 0.00401107, 0.09243118, 0.01815056, 0.11413166, 0.10100969, 0.00941286, 0.01425051, 0.02446495, 0.12945962, 0.01079693, 0.00859418])
Total ten features are selected according to SelectFromModel.
feature_selected = x_train.columns[sfm.get_support()] feature_selected
Index(['mean radius', 'mean perimeter', 'mean area', 'mean concavity', 'mean concave points','area error','worst radius', 'worst perimeter', 'worst area', 'worst concave points'], dtype='object')
Now let’s check the accuracy for all features and selected features. For that, we leverage a user-defined function that takes training and testing data to give accuracy.
def model_accuracy(x_train,x_test,y_train,y_test): model = RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1) model.fit(x_train,y_train) #training_score = model.score(x_train,y_train) y_pred = model.predict(x_test) return accuracy_score(y_test,y_pred)*100
Assign new features-
x_train_fs = sfm.transform(x_train) x_test_fs = sfm.transform(x_test)
The accuracy on all the features and selected features, respectively.
model_accuracy(x_train,x_test,y_train,y_test) 94.73684210526315 model_accuracy(x_train_fs,x_test_fs,y_train,y_test) 95.6140350877193
You can see how feature selection can improve the results. One can also check detailed classification reports, and there also you can see the improvements. Now let’s check the result with RFE with cross-validation.
Recursive Feature Elimination (RFE) with cross-validation:
RFE is the same as SelectFromModel, but it chooses the features recursively, as said earlier. Sklearn provides a separate RFE where you have to mention how many features you want for the model, but this approach is not advisable because we don’t know which feature combination gives the best result. Furthermore, even if you decide to use only RFE, you have to write additional iterable code to evaluate performance from a single feature to all the features.
Here comes cross-validation for rescue. Sklearn provides RFE along with cross-validation as RFECV. Where you have to mention the learning estimator and cross-validation technique, here we have used StratifiedKfold.
Once configured, we have to fit the method for the training set, and by the get_support attribute, we can check the feature importance same as SelectFromModel.
rfecv = RFECV(RandomForestClassifier(n_estimators=100,random_state=True,n_jobs=-1),cv=StratifiedKFold(10)) rfecv.fit(x_train,y_train) # check the support rfecv.get_support() array([ True, True, True, True, False, False, True, True, False, False, False, False, False, True, False, False, False, False, False, False, True, True, True, True, True, True, True, True, False, False])
# name wise features and count feature_selected = x_train.columns[rfecv.get_support()] feature_selected Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean concavity', 'mean concave points', 'area error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points'], dtype='object')
It has selected the 15 features out of 30, giving the accuracy greater than the previous method. Let’s check;
x_train_rfe = rfecv.transform(x_train) x_tets_rfe = rfecv.transform(x_test) model_accuracy(x_train_rfe,x_tets_rfe,y_train,y_test) 97.36842105263158
Now you can see the clear difference between the two feature selection methods. RFE gives the highest accuracy.
Feature selection using RFE with cross-validation:
We are generating a random dataset from make_regersssion.
from sklearn.datasets import make_regression x,y = make_regression(n_samples=1000,n_features=15,n_targets=1) x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = True) # fit the method rfecv = RFECV(GradientBoostingRegressor()) rfecv.fit(x_train,y_train) rfecv.get_support() array([ True, False, True, False, True, True, False, True, True, True, True, False, True, True, False]) # function to calculate accuracy def model_accuracy(x_train,x_test,y_train,y_test): model = GradientBoostingRegressor() model.fit(x_train,y_train) y_pred = model.predict(x_test) return r2_score(y_test,y_pred)*100
Accuracy for all features and selected features respectively.
model_accuracy(x_train,x_test,y_train,y_test) 90.48235983692466 x_train_rfe = rfecv.transform(x_train) x_tets_rfe = rfecv.transform(x_test) model_accuracy(x_train_rfe,x_tets_rfe,y_train,y_test) 90.71163963962526
For regression analysis, RFE performs slightly better than other models.
Note: Results may vary due to the stochastic nature of algorithms.
For the classification problem, total features are 30; by using SelectFromModel, features are reduced to 10 and give accuracy of 94%, whereas by using RFE, features are reduced to 15 and give accuracy of 97%. For the regression problem, there is no change in accuracy as much as we expected.
From this article, we have learned that Recursive feature elimination is an efficient approach to reduce features, and we have seen results practically by using python code.