# Principal Component Analysis On Matrix Using Python

Machine learning algorithms may take a lot of time working with large datasets. To overcome this a new dimensional reduction technique was introduced. If the input dimension is high Principal Component Algorithm can be used to speed up our machines. Machine learning algorithms may take a lot of time working with large datasets. To overcome this a new dimensional reduction technique was introduced. If the input dimension is high Principal Component Algorithm can be used to speed up our machines. It is a projection method while retaining the features of the original data.

In this article, we will discuss the basic understanding of Principal Component(PCA) on matrices with implementation in python. Further, we implement this technique by applying one of the classification techniques.

### Dataset

The dataset can be downloaded from the following link. The dataset gives the details of breast cancer patients. It has 32 features with 569 rows.

#### AIM Daily XO

##### Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy

Let’s get started.Import all the libraries required for this project.

```import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline```

```dataset = pd.read_csv('cancerdataset.csv')
dataset["diagnosis"]=dataset["diagnosis"].map({'M': 1, 'B': 0})
data=dataset.iloc[:,0:-1]

We need to store the independent and dependent variables by using the iloc method.

```X = data.iloc[:, 2:].values
y = data.iloc[:, 1].values ```

Split the training and testing data in the 80:20 ratio.

```from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) ```

### PCA Standardization

PCA can only be applied to numerical data. So,it is important to convert all the data into numerical format. We need to standardize data for converting features of different units to the same unit.

```from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test) ```

### Covariance Matrix

Based on standardized data we will build the covariance matrix. It gives the variance between each feature in our original dataset. The negative value in the result below represents are inversely dependent on each other.

```mean_vec=np.mean(X_train,axis=0)
cov_mat=(X_train-mean_vec).T.dot((X_train-mean_vec))/(X_train.shape-1)
mean_vect=np.mean(X_test,axis=0)
cov_matt=(X_test-mean_vec).T.dot((X_test-mean_vec))/(X_test.shape-1)
print(cov_mat)```

### Eigen Decomposition on Covariance Matrix

Each eigenvector will have an eigenvalue and sum of the eigenvalues represent the variance in the dataset. We can get the location of maximum variance by calculating eigenvalue. The eigenvector with lowest eigenvalue will give the lowest amount of variation in the dataset. These values need to be dropped off.

```cov_mat=np.cov(X_train.T)
eig_vals,eig_vecs=np.linalg.eig(cov_mat)
cov_matt=np.cov(X_test.T)
eig_vals,eig_vecs=np.linalg.eig(cov_mat)
print(eig_vals)
print(eig_vecs)```

We need to specify how many components we want to keep. The result gives a reduction of dimension from 32 to 2 features. The first and second PCA will capture the most variance in the original dataset.

```from sklearn.decomposition import PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
X_train.shape```
`pca.components_`

In this matrix array, each column represents the original data, and each row represents a PCA.

### Fitting DecisionTree Regression To the training set

As we are solving a classification problem, we can use the Decision Tree Classifier for model prediction.

```from sklearn.tree import DecisionTreeClassifier
# Create Decision Tree classifier object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifier
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)```

### Evaluating the Algorithm

For classification tasks, we will use a confusion matrix to check the accuracy of our machine learning model.

```from sklearn.metrics import confusion_matrix
confusion = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
confusion```

### Plot the training set

```from matplotlib.colors import ListedColormap
X1, y1 = X_train, y_train
a, b = np.meshgrid(np.arange(start = X1[:, 0].min() - 1,
stop = X1[:, 0].max() + 1, step = 0.01),
np.arange(start = X1[:, 1].min() - 1,
stop = X1[:, 1].max() + 1, step = 0.01))
plt.contourf(a, b, clf.predict(np.array([a.ravel(),
b.ravel()]).T).reshape(a.shape), alpha = 0.75,
cmap = ListedColormap(('white')))
plt.xlim(a.min(), a.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X1[y1 == j, 0], X1[y1 == j, 1],
c = ListedColormap(('red','blue'))(i), label = j)
plt.title('Decision Tree')
plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend() # to show legend
# show scatter plot
plt.show()  ```

### Final Thoughts

In the above article, we discussed how PCA is used for dimension reduction of large dataset. Further we have explored concepts like covariance matrix and eigen decomposition for calculating a principal component. Hope this article is useful to you.

## The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology. A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

## Our Upcoming Events

24th Mar, 2023 | Webinar

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

### Telegram group

Discover special offers, top stories, upcoming events, and more.

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

#### Subscribe to our Daily newsletter

##### MOST POPULAR ### Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present. ### Council Post: Future of Careers in AI (after the revolution of Generative AI)

The way many people work could be fundamentally changed by generative AI. Some people might be excited by this concept. What this entails for others may be a concern. In industries where automation is possible, there is no doubt that this technology has the potential to greatly boost productivity and save costs.