Practical Approach to Dimensionality Reduction Using PCA, LDA and Kernel PCA
In this article, we will discuss the practical implementation of three dimensionality reduction techniques - Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and
Kernel PCA (KPCA)
Dimensionality reduction is an important approach in machine learning. A large number of features available in the dataset may result in overfitting of the learning model. To identify the set of significant features and to reduce the dimension of the dataset, there are three popular dimensionality reduction techniques that are used. In this article, we will discuss the practical implementation of these three dimensionality reduction techniques:-
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA), and
Kernel PCA (KPCA)
Dimensionality Reduction Techniques
Principal Component Analysis
Principal Component Analysis (PCA) is the main linear approach for dimensionality reduction. It performs a linear mapping of the data from a higher-dimensional space to a lower-dimensional space in such a manner that the variance of the data in the low-dimensional representation is maximized. For more information, read this article.
Practical Implementation PCA
In this implementation, we have used the wine classification dataset, which is publicly available on Kaggle. Follow the steps below:-
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
#1. Import the librariesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pd#2. Import the datasetdataset = pd.read_csv('Wine.csv')X = dataset.iloc[:, 0:13].valuesy = dataset.iloc[:, 13].values#3. Split the dataset into the Training set and Test setfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#4. Feature Scaling
from sklearn.preprocessing import StandardScalersc = StandardScaler()X_train = sc.fit_transform(X_train)X_test = sc.transform(X_test)#5. Apply PCAfrom sklearn.decomposition import PCApca = PCA(n_components = 2)X_train = pca.fit_transform(X_train)X_test = pca.transform(X_test)explained_variance = pca.explained_variance_ratio_#6. Fit the Logistic Regression to the Training setfrom sklearn.linear_model import LogisticRegressionclassifier = LogisticRegression(random_state = 0)classifier.fit(X_train, y_train)
#7. Predict the Test set resultsy_pred = classifier.predict(X_test)
#8. Make the Confusion Matrixfrom sklearn.metrics import confusion_matrixcm = confusion_matrix(y_test, y_pred)
#9. Visualize the Training set resultsfrom matplotlib.colors import ListedColormapX_set, y_set = X_train, y_trainX1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))plt.xlim(X1.min(), X1.max())plt.ylim(X2.min(), X2.max())for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green', 'blue'))(i), label = j)plt.title('Logistic Regression (Training set)')plt.xlabel('PC1')plt.ylabel('PC2')plt.legend()plt.show()
#10.Visualize the Test set results
from matplotlib.colors import ListedColormapX_set, y_set = X_test, y_testX1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))plt.xlim(X1.min(), X1.max())plt.ylim(X2.min(), X2.max())for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green', 'blue'))(i), label = j)plt.title('Logistic Regression (Test set)')plt.xlabel('PC1')plt.ylabel('PC2')plt.legend()plt.show()
Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is used to find a linear combination of features that characterizes or separates two or more classes of objects or events. It explicitly attempts to model the difference between the classes of data. It works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis.
Practical Implementation
In this implementation, we have used the wine classification dataset, which is publicly available on Kaggle. Follow the steps below:-
#1. Import the librariesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pd
#2. Import the dataset
dataset = pd.read_csv('Wine.csv')X = dataset.iloc[:, 0:13].valuesy = dataset.iloc[:, 13].values
#3. Split the dataset into Training set and Test set
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#4. Feature Scaling
from sklearn.preprocessing import StandardScalersc = StandardScaler()X_train = sc.fit_transform(X_train)X_test = sc.transform(X_test)
#5. Apply LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDAlda = LDA(n_components = 2)X_train = lda.fit_transform(X_train, y_train)X_test = lda.transform(X_test)
#6. Fit Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegressionclassifier = LogisticRegression(random_state = 0)classifier.fit(X_train, y_train)
#7. Predict the Test set results
y_pred = classifier.predict(X_test)
#8.Make the Confusion Matrix
from sklearn.metrics import confusion_matrixcm = confusion_matrix(y_test, y_pred)
#9. Visualize the Training set results
from matplotlib.colors import ListedColormapX_set, y_set = X_train, y_trainX1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))plt.xlim(X1.min(), X1.max())plt.ylim(X2.min(), X2.max())for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green', 'blue'))(i), label = j)plt.title('Logistic Regression (Training set)')plt.xlabel('LD1')plt.ylabel('LD2')plt.legend()plt.show()
#10. Visualize the Test set results
from matplotlib.colors import ListedColormapX_set, y_set = X_test, y_testX1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))plt.xlim(X1.min(), X1.max())plt.ylim(X2.min(), X2.max())for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green', 'blue'))(i), label = j)plt.title('Logistic Regression (Test set)')plt.xlabel('LD1')plt.ylabel('LD2')plt.legend()plt.show()
Kernel Principal Component Analysis
Kernel Principal Component Analysis (KPCA) is an extension of PCA that is applied in non-linear applications by means of the kernel trick. It is capable of constructing nonlinear mappings that maximize the variance in the data.
Practical Implementation
In this practical implementation kernel PCA, we have used the Social Network Ads dataset, which is publicly available on Kaggle. Follow the steps below:-
#1. Import the librariesimport numpy as npimport matplotlib.pyplot as pltimport pandas as pd
#2. Import the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')X = dataset.iloc[:, [2, 3]].valuesy = dataset.iloc[:, 4].values
#3. Split the dataset into the Training set and Test set
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
#4. Feature Scaling
from sklearn.preprocessing import StandardScalersc = StandardScaler()X_train = sc.fit_transform(X_train)X_test = sc.transform(X_test)
#5. Apply Kernel Kernel PCA
from sklearn.decomposition import KernelPCAkpca = KernelPCA(n_components = 2, kernel = 'rbf')X_train = kpca.fit_transform(X_train)X_test = kpca.transform(X_test)
#6. Fit Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegressionclassifier = LogisticRegression(random_state = 0)classifier.fit(X_train, y_train)
#7. Predict the Test set results
y_pred = classifier.predict(X_test)
#8. Make the Confusion Matrix
from sklearn.metrics import confusion_matrixcm = confusion_matrix(y_test, y_pred)
#9. Visualize the Training set results
from matplotlib.colors import ListedColormapX_set, y_set = X_train, y_trainX1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green')))plt.xlim(X1.min(), X1.max())plt.ylim(X2.min(), X2.max())for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)plt.title('Logistic Regression (Training set)')plt.xlabel('Age')plt.ylabel('Estimated Salary')plt.legend()plt.show()
#10. Visualize the Test set results
from matplotlib.colors import ListedColormapX_set, y_set = X_test, y_testX1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green')))plt.xlim(X1.min(), X1.max())plt.ylim(X2.min(), X2.max())for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)plt.title('Logistic Regression (Test set)')plt.xlabel('Age')plt.ylabel('Estimated Salary')plt.legend()plt.show()
Comparison of PCA, LDA and Kernel PCA
All of these dimensionality reduction techniques are used to maximize the variance in the data but these all three have a different characteristic and approach of working.
The difference in Strategy:
The PCA and LDA are applied in dimensionality reduction when we have a linear problem in hand that means there is a linear relationship between input and output variables. On the other hand, the Kernel PCA is applied when we have a nonlinear problem in hand that means there is a nonlinear relationship between input and output variables. So the PCA and LDA can be applied together to see the difference in their result. But the Kernel PCA uses a different dataset and the result will be different from LDA and PCA. Although PCA and LDA work on linear problems, they further have differences. The LDA models the difference between the classes of the data while PCA does not work to find any such difference in classes. In PCA, the factor analysis builds the feature combinations based on differences rather than similarities in LDA. The discriminant analysis as done in LDA is different from the factor analysis done in PCA where eigenvalues, eigenvectors and covariance matrix are used.
The difference in Results:
As we have seen in the above practical implementations, the results of classification by the logistic regression model after PCA and LDA are almost similar. The main reason for this similarity in the result is that we have used the same datasets in these two implementations. Because there is a linear relationship between input and output variables. The task was to reduce the number of input features. Both dimensionality reduction techniques are similar but they both have a different strategy and different algorithms. On the other hand, a different dataset was used with Kernel PCA because it is used when we have a nonlinear relationship between input and output variables. The result of classification by the logistic regression model re different when we have used Kernel PCA for dimensionality reduction.
Dr. Vaibhav Kumar is a seasoned data science professional with great exposure to machine learning and deep learning. He has good exposure to research, where he has published several research papers in reputed international journals and presented papers at reputed international conferences. He has worked across industry and academia and has led many research and development projects in AI and machine learning. Along with his current role, he has also been associated with many reputed research labs and universities where he contributes as visiting researcher and professor.
The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces
With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.