To predict results more accurately in machine learning, we require more cleaned up data with the input variables affecting the output variable. There are many methods to clean the data; more formally, we call this process EDA(exploratory data analysis), but what if the input variables are too high in a dataset? Many of them do not affect the output variable but affect the overall result. If the number of unused variables is high, we must drop them from the dataset for performing further machine learning tasks more accurately. Reducing the number of input variables for predictive analysis is called dimensionality reduction.
As suggested, it is very fruitful to put fewer input variables from the data in predictive models, which causes a simpler predictive model with higher performance.
Introduction to SVD
The singular-value decomposition/ SVD is a dimension reduction technique for matrices that reduces the matrix into its component to simplify the calculation.
Mathematically we can say that factorization of any matrix(m×n) into its eigendecomposition or unitary matrix U(m×m), rectangular diagonal matrix 𝚺(m×n) and V*(n×n) complex unitary matrix is called singular-value decomposition.
Matrix = U.𝚺.V*
Fig1. factorization of matrix image source
SVD is a popular method for dimensionality reduction. However, it works better with sparse data. Here sparse data refers to the data with many zero values. There are many cases where sparse data gets generated, like in a recommendation system of products in an e-commerce website where every user can give a rating or review. Still, many of them left the portion blank, which generates zero values in the data.
Some of the examples where sparse data generates are:
We can further explain SVD in the projection method, where a matrix of m-columns gets separated into m matrices. There are many kinds of SVD methods-
- Truncated SVD
- Partial least square SVD
- Randomized SVD
In this article, we will discuss the truncated SVD and how to use it for dimension reduction.
Truncated Singular Value Decomposition
As discussed above, it is a matrix factorization technique similar to PCA (principal component analysis). However, we perform Truncated SVD or any SVD on the data matrix, whereas we use PCA on the covariance matrix.
Truncated SVD factorized data matrix where the number of columns is equal to the truncation. It drops the digits after the decimal place for shorting the value of float digits mathematically. For example, 2.498 can be truncated to 2.5.
A given m⤫n matrix truncated SVD will produce matrices with the specified number of columns, whereas a normal SVD procedure will produce with m columns. It means that it will drop off all features except the number of features provided to it.
For example, let’s just perform it in python with the IRIS dataset.
Setting up the environment in google colab
Requirements: python 3.7 or above, scikit-learn 0.24.2.
Importing the libraries.
from sklearn.datasets import load_iris from sklearn.decomposition import TruncatedSVD
Loading the iris dataset
iris = load_iris() X = iris.data Y = iris.target X[:10]
Applying truncatedSVD in iris data set with two columns.
Defining the truncatedSVD model.
Fitting the data set into truncatedSVD:
X_truncated = truncatedSVD.fit_transform(X) X_truncated[:10]
Here we can see that we have reduced the dimension of the iris data set using truncated SVD.
In the next step, we will evaluate the truncated SVD with a random forest algorithm for classification.
To perform this, we are generating classification data using the scikit-learn librarie’s make_classification module.
from sklearn.datasets import make_classification X, y = make_classification(n_samples=2000, n_features=30, n_informative=20, n_redundant=10, random_state=7) print(X.shape, y.shape)
In the above, we have generated data with 30 features and 2000 rows.
In the next step, we will fit the truncated SVD in the data set for dimension reduction, and after this, reduced data will be fitted to a random forest model. By this time, we will cross-validate the model with ten splits and three repeats.
Here I am creating a pipeline to where all the steps for the next procedure are defined.
import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.decomposition import TruncatedSVD from sklearn.ensemble import RandomForestClassifier steps = [('svd', TruncatedSVD(20)), ('m', RandomForestClassifier())] model = Pipeline(steps=steps) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print((np.mean(n_scores), np.std(n_scores))) n_scores
Here in the output, we can see the mean results and the standard deviation in the accuracy after every evaluation completed by the stratified k-fold cross-validation.
This is a good result, but it is a better approach to get the accuracy from different numbers of input features and choose the best one out of them. In the next step, we will perform this to get the exact number of features required for model fitting.
# defining dataset X, y = make_classification(n_samples=2000, n_features=30, n_informative=20, n_redundant=10, random_state=7) # get a list of models def get_models(): models = dict() for i in range(1,30): steps = [('svd', TruncatedSVD(n_components=i)), ('m', LogisticRegression())] models[str(i)] = Pipeline(steps=steps) return models # evaluate a give model using cross-validation def evaluate_model(model, X, y): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model, X, y) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.xticks(rotation=45) pyplot.show()
In the last output, we can see that after the 20th feature, the random forest has stopped improving the accuracy and the deviation becomes zero. So it would be a good step if we choose 20 features in making a model by using this data. So by this procedure, we can easily classify the features of any data and improve the model’s performance by using truncated SVD as a dimensionality reduction method.
This article has discussed how the singular value decomposition works mathematically and how truncated singular value decomposition works under the hood to determine the principal component. The truncated SVD is different from regular SVD; truncated SVD does not centre the data before computing the SVD, making it more fruitful to use with sparse data.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.