Hands-On Tutorial On Principal Component Analysis In Python

Real-world data may not be as simple as predicting a person’s salary against his or her experience. While this is just one example, there can be so many factors that may affect an employee’s salary. Real-world data is much more complex and therefore identifying and predicting a dependent factor against so many independent factors can reduce the probability of getting a correct prediction. That is why it is important to identify strong independent variables. Dimensionality Reduction is a technique that allows us to understand the independent variables and their variance thus helping to identify a minimum number of independent variables that has the highest variance with respect to the dependent variables.

In simple terms, the dimensionality reduction technique helps us reduce the number to independent variables in a problem by identifying new and most effective ones

Implementing Principal Component Analysis In Python

In this simple tutorial, we will learn how to implement a dimensionality reduction technique called  Principal Component Analysis (PCA) that helps to reduce the number to independent variables in a problem by identifying Principle Components.We will take a step by step approach to PCA.

Scaling The Data

Before jumping in to identify the strongest factors in your dataset, let it be any, we must make sure that all the data are in the same scale. If the data is not properly scaled it will lead to a false and inaccurate prediction as larger values will show larger effect.

```from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)```
We have used the StandardScaler class of sklearn.preprocessing package for scaling the dataset.

• X_train : training set containing only the independent features
• X_test : test set containing only the independent features

Applying PCA to understand the Independent Factors

After the data is properly scaled We can apply Dimensionality Reduction to identify a set of new  strong features or Principal Components

```from sklearn.decomposition import PCA pca = PCA(n_components =  None) pca.fit(X_train) variance = pca.explained_variance_ratio_```

• n_components: number of principal components to identify

Here we have used the sklearn. decomposition library to import the PCA class. We then initialised the PCA class with n_components = None as we have no prior knowledge about the variance of factors. The PCA object is then fitted with with the independent variable set to calculates the variance. The explained_variance_ratio_ of the PCA object returns a numpy array containing variances of Principal Components sorted in descending order. (The number of principal Components will be same as the number of factors in X_train). The higher value in the numpy array denotes high variance.

From the obtained variances, choose the minimum number of principal components with the highest variances.

Applying PCA And Transforming The Datasets

After obtaining the minimum number of features(Principal Components with high variance), reinitialise the PCA with n_components as the number of Principal Components.Transform the training set and  test set

```pca = PCA(n_components = number of Principal Components ) X_train = pca.fit_transform(X_train) X_test = pca.transform(X_test) explained_variance = pca.explained_variance_ratio_```

The above code will transform the X_train and X_test to training sets containing only the specified number of principal components.

Using Principal Components For Prediction

X_train and X_test can now be fitted to any predictive model depending on the nature of the problem

Example:

```#Fitting Logistic Regression to the Training set classifier = LogisticRegression() classifier.fit(X_train, y_train) #Predicting the Test set results y_pred = classifier.predict(X_test)```

y_train: training set containing only the dependent factor

Important things to note:

• PCA will take all the original training set variables and decompose them in a manner to make a new set of variables with high explained variance.
• Principal component analysis involves extracting linear composites of observed variables.
• PCA can be used to determine what amount of variability the independent variables can explain for the dependent variable and cannot be used to see whIch independent variables are more important for prediction.

A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact: amal.nair@analyticsindiamag.com

Oct 11-13, 2023 | Bangalore

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Believe it or Not, 55% of Digital Frauds Happen Via UPI

Among the various payment systems in the country, UPI has emerged as a prime target for fraudsters

AI Battle Heats Up: Microsoft to Take on Apple Head-on

With Microsoft’s new partnerships, the pillars of the PC ecosystem have teamed up to challenge Apple’s dominance in the AI ecosystem.

8 Ways NVIDIA Will Make Its Next Trillion

NVIDIA recently became the 7th company in the world to reach a trillion dollar market cap, but all the riches in the world aren’t enough.

Merck Group and Palantir Forge Ahead with Open Collaboration

The open-source library created by Merck, in partnership with Palantir Technologies, serves as a crucial component of their digitalisation strategy. Subbu Iyer articulates the significance of this library

Top 5 Companies Hiring for Data Science Roles

Microsoft, Zoom, Accenture, JP Morgan & Chase, and Cisco are among the leading tech giants that are hiring for roles in data science

Is Indian Govt’s Battle Against AI Disinformation Flawed?

AI models like Stable Diffusion, Midjourney and DALL-E2 can generate hyper realistic images that can easily be mistaken for genuine ones

Uncensored Models Outperform Aligned Language Models

Do you really want a chatbot to not give out the information you want just to stay aligned?

In 30 Years, NVIDIA Died Almost 3 Times

Jensen Huang’s NTU speech highlights NVIDIA’s resilience and future-thinking in spite of the company reaching the brink of failure thrice in three decades

The Rise And Rise of NVIDIA

NVIDIA holds 88% of GPUs in the world leaving 12% to its competitors AMD and Intel.