A Complete Guide to Sequential Feature Selection

Share

Published on August 29, 2021

by Yugesh Verma

In machine learning, feature selection is the procedure of selecting important features from the data so that the output of the model can be accurate and according to the requirement. Since in real-life development procedure, the data given to any modeller has various features and it happens all the time that there are various features given in the data which are not even required for the generation of the models and also the presence of those features can reduce the performance level of the model. So in modelling, it becomes an important step of data preprocessing.

Basically, the feature selection is a method to reduce the features from the dataset so that the model can perform better and the computational efforts will be reduced. In feature selection, we try to find out input variables from the set of input variables which are possessing a strong relationship with the target variable. This means changes in an input variable should form changes in the output variable.

There can be various reasons to perform feature selection.

Simplification of the model.
Less computational time.
To avoid the curse of dimensionality.
Improve the compatibility of data with models.

Roughly the feature selection techniques can be divided into three parts.

Filter method.
Wrapper method.
Embedded method.

Filter methods

These methods are very fast and easy to do the feature selection. In this method, we perform feature selection at the time of preprocessing of the data. These methods select the features before using a machine learning algorithm on the given data. But the problem with the method is that it does not remove the multicollinearity from the data. The basic architecture of the modelling procedure with these methods of feature selection is as follows.

Some of the techniques under the filter methods are:

Information gain.
Chi-square test.
Fisher’s Score.
Correlation coefficient.
Variance threshold
Mean absolute deviation.
dispersion coefficient

Wrapper Methods(Greedy Algorithms)

In this method, feature selection algorithms try to train the model with a reduced number of subsets of features in an iterative way. In this method, the algorithm pushes a set of features iteratively in the model and in iteration the number of features gets reduced or increased. The stopping criteria of the iteration are to select the features which will give the best result in modelling. So at the end of the procedure of the wrapper method, the optimal set of features gets selected for the modelling.

The basic architecture of the technique followed by the wrapper methods is as follows:

Some of the techniques under these methods are:

Foreword selection.
Backward selection.
Bi-directional selection.
Exhaustive selection.
Recursive selection.

Note- the main motive of the article is to talk about the sequential feature selection which is a technique related to the wrapper methods so later in the article we will discuss the wrapper methods.

Embedded methods

This method is the combination of the filter method and wrapper methods which is fast as filter methods are and accurate as of the wrapper method. These methods are blended as part of the learning algorithm. They are most accurate because they overcome the drawbacks of filter and wrapper methods.

Some of the techniques under this method are:

Regularization: Main regularization methods are:

Lasso regularization.
Ridge regularization.
Elastic net regularization.

Tree-based methods: Main tree-based methods are:

LightGBM.
XGBoost.

The basic architecture of any procedure containing embedded methods as feature selection techniques is as follows.

Sequential Feature Selection Algorithms

Sequential feature selection algorithms are basically part of the wrapper methods where it adds and removes features from the dataset sequentially. Sometimes it evaluates each feature separately and selects M features from N features on the basis of individual scores; this method is called naive sequential feature selection. It works very rarely because it does not account for feature dependence.

In a proper technique, the algorithm selects multiple features from the set of features and evaluates them for model iterate number between the different sets with reducing and improving the number of features so that the model can meet the optimal performance and results.

Mathematically these algorithms are used for the reduction of initial N features to M features where M<N. and the M features are optimized for the performance of the model.

The sequential feature selection method has two components:

An objective function:

The method finds to minimize the number of overall features in a subset from the set of all features. So the results can be enhanced. It can be called the criterion where the mean squared error is a criterion for regression models and the misclassification rate is a criterion for the classification model.

A sequential search algorithm:

This searching algorithm adds or removes the feature candidate from the candidate subset while evaluating the objective function or criterion. Sequential searches follow only one direction: either it increases the number of features in the subset or reduces the number of features in the candidate feature subset.

On the basis of movement, we can divide them into two variants.

Sequential forward selection(SFS)

In SFS variant features are sequentially added to an empty set of features until the addition of extra features does not reduce the criterion.

Mathematically if the input data in the algorithm is

Then the output will be :

Where the selected features are k and K<d.

In the initialization X is a null set and k=0 (where k is the size of the subset).

In the termination, the size is k = p where p is the number of desired features.

Sequential Backward Selection (SBS)

This variant algorithm picks all the features from the input data and combines them in a set and sequentially removes them from the set until the removal of further features increases the criterion.

mathematically if the input data is

The output of the variant will be

In the initialization X is a subset of features and k=d (where k is the size of the subset).

In the termination, the size is k = p where p is the number of desired features.

There are two more variants of the sequential feature selection.

Sequential forward floating selection.
Sequential backward floating selection.

These floating variants are the extensions of the SFS and SBS where they consist of an additional execution or inclusion step to remove features if once they are included or excluded in the procedure. These extensions provide a larger number of features in the final combination of features sets.

Let’s see how we can implement these all using python.

MLxtend is a package that provides the implementation of sequential feature selection methods.

You can check the whole code at this link.

Here in the article, I will just give the images and will explain to them how did it work in the background

In this procedure, I am using the iris data set and feature_selection module provided in mlxtend library.

In the following codes after defining x, y and the model object we are defining a sequential forward selection object for a KNN model.

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(knn,

k_features=3,

forward=True,

floating=False,

verbose=2,

scoring=’accuracy’,

cv=0)

sfs1 = sfs1.fit(X, y)

Output:

Here in the output, we can see all the scores with the number of features selected in the process. In the SFS object we have provided our model which is KNN and the number of the input features in the data set forward = True will tell the object to follow the SFS and floating + false because this is an example of SFS not SFFS of SFBS.

Using sfs.subsets_ we can cross-check all the results of every step.

Here in the results, we can see the results of every step with the name of the feature in the data set.

Just by changing the parameters, we can toggle between all the variants of the sequential feature selection.

For SBS

forward = False
floating = False

For SFFS

forward = True
floating = True

For SBFS

forward = False
floating = True

Printing the results.

We can also save the results in the DataFrame.

Results of SFS

Results of SBS

Here we can see the SFS and SBS found that their best result in three features however the procedure of feature selection was different.

In the data frame;

cv_score = cross validation score.
ci_score = confidence interval.
std_dev = standard deviation of cross validation score.
Std_err = standard error of the cross validation score.

Visualizing the standard deviation for the SFS

Here in this article, we have seen the basics of feature selection and why it becomes an important part of the modelling procedure with the data. There are various methods that can be used for feature selection. But we prefer to go with the sequential feature selection because by the time performing the modelling we can have full control of the procedure wherein embedded methods we can just perform it we can not control it in between the procedure and filtering methods are faster but they are not accurate. The main motive of the article was to make feature selection automatic but should be in the controlled form. In the reference, I have provided the link for the whole code, so if any reader wants to practice them they can access the notebook.

References:

Access all our open Survey & Awards Nomination forms in one place