Now Reading
Hands-On Guide To Automated Feature Selection Using Boruta

Hands-On Guide To Automated Feature Selection Using Boruta

Automated Feature Selection using Boruta

Feature selection is one of the most crucial and time-consuming phases of the machine learning process, second only to data cleaning. What if we can automate the process? Well, that’s exactly what Boruta does. Boruta is an algorithm designed to take the “all-relevant” approach to feature selection, i.e., it tries to find all features from the dataset which carry information relevant to a given task. The counterpart to this is the “minimal-optimal” approach, which sees the minimal subset of features that are important in a model. 

It is originally an R package that has been recoded in Python with some additions and improvements:

REGISTER FOR OUR UPCOMING ML WORKSHOP
  • Faster run times
  • Scikit-learn like interface, it uses fit(X, y), transform(X), or fit_transform(X, y), to run the feature selection.
  • Compatible with any ensemble method from scikit-learn
  • Automatic n_estimator selection
  • Ranking of features
  • Gini impurity is used to derive the importance of features instead of the RandomForest R package’s MDA.

Algorithm

Here’s the algorithm behind Boruta, as mentioned in the paper:

  1. Extend the information system by adding copies of all variables (the information system is always extended by at least 5 shadow attributes, even if the number of attributes in the original set is lower than 5).
  2. Shuffle the added attributes to remove their correlations with the response.
  3. Run a random forest classifier on the extended information system and gather the Z scores computed.
  4. Find the maximum Z score among shadow attributes (MZSA), and then assign a hit to every attribute that scored better than MZSA.
  5. For each attribute with undetermined importance perform a two-sided test of equality with the MZSA.
  6. Deem the attributes significantly lower than MZSA as ‘unimportant’ and permanently remove them from the information system.
  7. Deem the attributes which have importance significantly higher than MZSA as ‘important’.
  8. Remove all shadow attributes.
  9. Repeat the procedure until the importance is assigned for all the attributes, or the algorithm has reached the previously set limit of the random forest runs.

Basically, a number of randomly shuffled shadow attributes are created to establish the baseline performance. A hypothesis test is then used to determine whether a variable is only randomly correlated or carries significant information. This test, by default, is carried out with a significance level of .05, this can be changed using the alpha argument when creating a BorutaPy object. 

Variables that fail to reject this hypothesis are discarded. As Boruta iteratively removes uninformative variables, the feature importance of the remaining relevant variables will improve. The comparatively noisier variables will see larger improvements. This happens because the random variables that the comparatively noisier relevant variables were correlated with have been discarded from the dataset because the noise, the random variables. 

Here’s a plot from the original paper illustrating the evolution of relevance(Z) score, pay close attention to the one glimmering green line amidst the red mess in the first round of Boruta run.

Z score evolution during Boruta run.
Z score evolution during Boruta run. Green lines correspond to confirmed attributes, red to rejected ones, and blue to respectively minimal, average, and maximal shadow attribute importance. 

Requirements

  • numpy
  • scipy
  • scikit-learn

Installation

pip:

pip install Boruta

Conda:

conda install -c conda-forge boruta_py

See Also
Feature story

Using Boruta for feature selection

  1. Importing Boruta and other required libraries. 
 import pandas as pd
 import numpy as np
 from sklearn.ensemble import RandomForestClassifier
 from boruta import BorutaPy
 from sklearn.model_selection import train_test_split
 from sklearn.metrics import accuracy_score 
  1. Loading the dataset, separating the features from the target variable, and splitting the data into a train and a dev set.
 URL = "https://raw.githubusercontent.com/Aditya1001001/English-Premier-League/master/pos_modelling_data.csv"
 data = pd.read_csv(URL)
 data.info()
 X = data.drop('Position', axis = 1)
 y = data['Position']
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 1) 
  1. Creating a baseline RandomForrestClassifier model with all the features.
 rf_all_features = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
 rf_all_features.fit(X_train, y_train) 

accuracy_score(y_test, rf_all_features.predict(X_test))

  1. Creating a BorutaPy object with RandomForestClassifier as the estimator and ranking the features. 

One important thing to note here is that Boruta works on NumPy arrays only

 rfc = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
 boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2, random_state=1)
 boruta_selector.fit(np.array(X_train), np.array(y_train))  
BorutaPy object created with RandomForestClassifier as the estimator
 print("Ranking: ",boruta_selector.ranking_)          
 print("No. of significant features: ", boruta_selector.n_features_) 

Boruta has selected 31 features, the features with rank 1 are selected. Let’s create a table and see exactly what features were rejected.

 selected_rf_features = pd.DataFrame({'Feature':list(X_train.columns),
                                       'Ranking':boruta_selector.ranking_})
 selected_rf_features.sort_values(by='Ranking') 
  1. Using the BorutaPy object to transform the features in the dataset.
 X_important_train = boruta_selector.transform(np.array(X_train))
 X_important_test = boruta_selector.transform(np.array(X_test)) 
  1. Creating another RandomForestClassifier model with the same parameters as the baseline classifier and training it with the selected features.
 rf_boruta = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
 rf_boruta.fit(X_important_train, y_train) 

accuracy_score(y_test, rf_boruta.predict(X_important_test))

What Do You Think?

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top