Now Reading
Beginners Guide To Truncated SVD For Dimensionality Reduction

Beginners Guide To Truncated SVD For Dimensionality Reduction

To predict results more accurately in machine learning, we require more cleaned up data with the input variables affecting the output variable. There are many methods to clean the data; more formally, we call this process EDA(exploratory data analysis), but what if the input variables are too high in a dataset? Many of them do not affect the output variable but affect the overall result. If the number of unused variables is high, we must drop them from the dataset for performing further machine learning tasks more accurately. Reducing the number of input variables for predictive analysis is called dimensionality reduction.

As suggested, it is very fruitful to put fewer input variables from the data in predictive models, which causes a simpler predictive model with higher performance.

Register for FREE Workshop on Data Engineering>>

 Introduction to SVD 

The singular-value decomposition/ SVD is a dimension reduction technique for matrices that reduces the matrix into its component to simplify the calculation.

Mathematically we can say that factorization of any matrix(m×n) into its eigendecomposition or unitary matrix U(m×m), rectangular diagonal matrix 𝚺(m×n) and V*(n×n) complex unitary matrix is called singular-value decomposition.

Matrix = U.𝚺.V*

                                            Fig1. factorization of matrix image source

SVD is a popular method for dimensionality reduction. However, it works better with sparse data. Here sparse data refers to the data with many zero values. There are many cases where sparse data gets generated, like in a recommendation system of products in an e-commerce website where every user can give a rating or review. Still, many of them left the portion blank, which generates zero values in the data.

 Some of the examples where sparse data generates are:

We can further explain SVD in the projection method, where a matrix of m-columns gets separated into m matrices. There are many kinds of SVD methods-

  • Truncated SVD
  • Partial least square SVD
  • Randomized SVD

In this article, we will discuss the truncated SVD and how to use it for dimension reduction.

Truncated Singular Value Decomposition 

As discussed above, it is a matrix factorization technique similar to PCA (principal component analysis). However, we perform Truncated SVD  or any SVD on the data matrix, whereas we use PCA on the covariance matrix.

Truncated SVD factorized data matrix where the number of columns is equal to the truncation. It drops the digits after the decimal place for shorting the value of float digits mathematically. For example, 2.498 can be truncated to 2.5.

 A given m⤫n matrix truncated SVD will produce matrices with the specified number of columns, whereas a normal SVD procedure will produce with m columns. It means that it will drop off all features except the number of features provided to it.

For example, let’s just perform it in python with the IRIS dataset. 

Setting up the environment in google colab

Requirements:  python 3.7 or above, scikit-learn 0.24.2.

Importing the libraries.  

Input:

 from sklearn.datasets import load_iris
 from sklearn.decomposition import TruncatedSVD 

Loading the iris dataset 

Input:

 iris = load_iris()
 X = iris.data
 Y = iris.target
 X[:10] 

Output:

Applying truncatedSVD in iris data set with two columns.

Defining the truncatedSVD model.

Input :

truncatedSVD=TruncatedSVD(2)

Fitting the data set into truncatedSVD:

Input:

 X_truncated = truncatedSVD.fit_transform(X)
 X_truncated[:10] 

Output:

Here we can see that we have reduced the dimension of the iris data set using truncated SVD.

In the next step, we will evaluate the truncated SVD with a random forest algorithm for classification.

See Also
NumPy feature image

To perform this, we are generating classification data using the scikit-learn librarie’s make_classification module.

Input :

 from sklearn.datasets import make_classification
 X, y = make_classification(n_samples=2000, n_features=30, n_informative=20, n_redundant=10, random_state=7)
 print(X.shape, y.shape) 

Output:

In the above, we have generated data with 30 features and 2000 rows.

In the next step, we will fit the truncated SVD in the data set for dimension reduction, and after this, reduced data will be fitted to a random forest model. By this time, we will cross-validate the model with ten splits and three repeats.

Here I am creating a pipeline to where all the steps for the next procedure are defined.

input:

 import numpy as np
 from sklearn.datasets import make_classification
 from sklearn.model_selection import cross_val_score
 from sklearn.model_selection import RepeatedStratifiedKFold
 from sklearn.pipeline import Pipeline
 from sklearn.decomposition import TruncatedSVD
 from sklearn.ensemble import RandomForestClassifier
 steps = [('svd', TruncatedSVD(20)), ('m', RandomForestClassifier())]
 model = Pipeline(steps=steps)
 cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
 n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
 print((np.mean(n_scores), np.std(n_scores)))
 n_scores 

Output:

Here in the output, we can see the mean results and the standard deviation in the accuracy after every evaluation completed by the stratified k-fold cross-validation. 

This is a good result, but it is a better approach to get the accuracy from different numbers of input features and choose the best one out of them. In the next step, we will perform this to get the exact number of features required for model fitting.

Input:

 # defining dataset
 X, y = make_classification(n_samples=2000, n_features=30, n_informative=20, n_redundant=10, random_state=7)
 # get a list of models
 def get_models():
   models = dict()
   for i in range(1,30):
     steps = [('svd', TruncatedSVD(n_components=i)), ('m', LogisticRegression())]
     models[str(i)] = Pipeline(steps=steps)
   return models
 # evaluate a give model using cross-validation
 def evaluate_model(model, X, y):
   cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
   scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
   return scores
 # get the models to evaluate
 models = get_models()
 # evaluate the models and store results
 results, names = list(), list()
 for name, model in models.items():
   scores = evaluate_model(model, X, y)
   results.append(scores)
   names.append(name)
   print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
 # plot model performance for comparison
 pyplot.boxplot(results, labels=names, showmeans=True)
 pyplot.xticks(rotation=45)
 pyplot.show() 

Output:

In the last output, we can see that after the 20th feature, the random forest has stopped improving the accuracy and the deviation becomes zero. So it would be a good step if we choose 20 features in making a model by using this data. So by this procedure, we can easily classify the features of any data and improve the model’s performance by using truncated SVD as a dimensionality reduction method.

Conclusion

This article has discussed how the singular value decomposition works mathematically and how truncated singular value decomposition works under the hood to determine the principal component. The truncated SVD is different from regular SVD; truncated SVD does not centre the data before computing the SVD, making it more fruitful to use with sparse data. 

References 

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top